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Chapter  1 

Program  Overview 


This  program  presents  a  new  zdgorithm  for  the  multiprocessor  computation  of  the  multi¬ 
dimensional  discrete  Fourier  transform  (MD  DFT).  The  main  features  of  the  algorithm 
are:  (1)  that  it  requires  no  interprocessor  communication,  and  (2)  that  it  is  highly 
scalable.  The  cost  of  eliminating  interprocessor  communication  by  this  method  is  that 
one  addition  must  be  performed  at  each  processing  element  for  each  input  broadczust 
to  the  machine. 

The  motivation  for  this  research  program  hzis  been  the  development  of  algorithmic 
methods  that  can  exploit  the  power  of  VLSI  based  multiprocessors.  These  machines 
have  large  computational  capabilities  but  limited  commimication  bandwidth.  They 
strongly  favor  algorithms  that  minimize  interprocessor  commxmications.  This  principle 
of  locality  [32]  dominates  algorithm  design  at  all  levels. 

Multidimensional  Cooley-Tukey  algorithms,  and  their  veiriants,  require  interproces¬ 
sor  communication  because  they  partition  the  data  set  at  every  stage  of  the  compu¬ 
tation.  At  a  minimum  this  necessitates  an  interprocessor  communication  requirement 
where  every  processing  element  must  exchange  data  with  every  other  processing  ele¬ 
ment  to  complete  the  calculation. 

Most  parallel  algorithms  for  MD  DFT  computation  attempt  to  minimize  the  in¬ 
terprocessor  communications  requirement  inherent  in  Cooley-Tukey  methods.  This 
research  program  has  taken  a  new  approach.  The  edgorithm  presented  does  not  apply 
a  partition  to  the  input  data.  Rather,  it  performs  a  reduction  operation  in  parallel 
on  the  data.  This  reduction  requires  each  processing  element  to  perform  one  addition 
for  each  word  loaded  into  the  machine.  The  computation  is  completed  by  performing 
lower-order/lower-dimension  MD  DFT  on  the  data,  derived  by  the  additions,  at  each 
processing  element.  For  a  d— dimensional  DFT  each  processor  would  be  required  to 
compute  a  A:— dimensional  DFT  where  k  is  one  of  A:  =  0, 1, . . . ,  d  —  1  depending  on  the 
degree  of  parallelism  of  the  implementation.  The  algorithm  eliminates  the  requirement 
of  the  interconnection  network  that  typicedly  dominates  parallel  computation  of  the 
MD  DFT  at  a  cost  of  one  addition  at  each  processing  element  for  each  word  loaded 
into  the  machine. 

The  algorithm  is  based  on  a  parallelization  of  the  hyperplane  zdgorithm  for  MD 
DFT  computation  presented  in  chapter  2.  Two  modes  of  parallelization  are  considered. 
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The  first  is  a  direct  approach  given  in  chapter  3.  In  this  form  the  algorithm  requires 
no  interprocessor  communication,  however  it  is  not  scalable.  A  hybrid  approach  that 
marries  multidimensional  Cooley-Tukey  type  methods  with  the  hyperplane  algorithm 
is  presented  in  chapter  4.  The  resulting  hybrid  algorithm  requires  no  interprocessor 
coiTiinunication  and  is  highly  scalable. 


1.1  Applications 


Computations  that  requires  fast  execution  of  MD  DFT  are  well  suited  to  parallel  meth¬ 
ods.  Computational  techniques  based  on  conventional  methods  require  extensive  in¬ 
terprocessor  communication  which  in  turn  require  an  interconnect  network  between 
processing  elements.  Interconnect  networks  Eire  slow,  large,  expensive,  Eind  consume 
large  amounts  of  power. 

The  hardware  requirement  for  parallel  MD  DFT  computation  is  drastically  reduced 
by  the  elimination  of  the  communication  interconnection  network.  This  allows  low  cost, 
low  power,  and  small  size  machines  to  compute  MD  DFT  at  high  speed. 

In  order  to  exploit  the  method  developed  in  this  program,  the  data  fiow  of  the 
target  problem  must  match  that  of  the  algorithm.  One  such  problem  is  the  processing 
of  synthetic  aperture  radar  (SAR)  data.  SAR  is  an  airborne  radar  imaging  technique 
that  generates  high  resolution  terrain  maps  by  processing  radeir  returns  [20].  A  well 
known  method  of  SAR  image  reconstruction  interprets  the  SAR  data  as  samples  of 
the  Fourier  transform  of  the  targeted  terrain.  The  image  is  reconstructed  by  inverse 
transforming  the  sampled  Fourier  data.  The  realtime  computation  of  the  SAR  image 
using  this  technique  requires  the  realization  of  large  scale  2D  DFT  and  interpolations. 
Under  current  technology,  parallel  computation  is  required  to  achieve  high  resolution 
in  realtime  or  near  realtime.  Realtime  systems  based  on  conventional  digital  signal 
processing  techniques  require  extensive  interprocessor  commimication  to  compute  the 
2D  DFT.  The  required  communication  networks  greatly  increase  the  size,  cost,  md 
power  consumption  of  those  computing  structures. 

By  the  method  developed  in  this  research  program,  this  complex  structure  could 
be  replaced  by  a  multiprocessor  with  only  broadcast  communications.  The  Fourier 
domain  samples  could  be  broadcast  to  a  collection  of  processing  elements  where  they 
are  transformed  into  the  spatial  dommn  by  the  method  described  in  this  research.  After 
the  computation  is  completed,  the  image  data  would  be  distributed  in  the  machine  in 
the  same  manner  as  if  a  vector-radix  MD  DFT  were  employed. 

Currently  other  applications  are  being  vigorously  explored  to  determine  how  suit¬ 
able  they  are  for  multiprocessor  computation  by  the  method  developed  in  this  research 
program. 
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Implementations 

For  proof  of  concept  a  variant  of  this  algorithm  was  implemented  on  the  AT&T  BTlOO 
[2]  multiprocessor.  The  implementation  was  benchmarked  over  ten  times  faster  then 
a  row-column  2D  DFT  of  the  same  size  on  the  machine.  The  cause  of  the  dramatic 
difference  in  speed  is  that  on  the  BTlOO  commimication  is  much  slower  than  compu¬ 
tation. 


1.2  Background 

Algorithms  that  compute  the  multidimensional  discrete  Fourier  trcuisform  (MD  DFT) 
have  been  roughly  categorized  as  row-column,  multidimensional  variants  of  Cooley- 
Tukey  (MD  CT),  and  reduced  transform  algorithms  (RTA).  Parallel  implementations 
of  the  row-column  and  multidimensional  Cooley-Tukey  algorithms  have  been  widely 
studied  in  the  literature,  a  short  list  of  these  includes  [19,18,37,13,24,6].  The  feature 
of  these  algorithms  most  pertinent  to  this  program  is  that  they  alternate  stages  of 
computation  with  stages  of  global  data  excheinge. 

The  row-column  algorithm  ev^duates  the  N  x  N  2D  DFT  by  computing  the  ID 
iV— point  DFT  of  the  rows  and  columns  of  the  2D  array.  Parallel  implementations  of 
this  algorithm  require  a  global  transpose  operation  between  the  row  and  column  FFT 
stages.  Specific  implementations  of  this  algorithm  differ  in  the  machine  model  and 
method  used  to  perform  the  transpose.  For  a  distributed  machine  model,  the  global 
transpose  operation  requires  every  processing  element  to  exchange  data  with  every 
other  processing  element.  Interconnection  networks  that  can  transpose  an  iV  x  AT  array 
distributed  on  N  processors  in  0{N)  time  include:  the  data  manipulator]  10],  omega[21], 
Staran  flip [5], generalized  cube[30],  emd  ADM[31].  The  data  manipulator  requires  log2N 
stages  of  N  switches,  while  the  others  require  log2N  stages  of  N/2  switches.  These 
systems  are  the  most  costly  and  complex  elements  of  the  multiprocessor  system. 

The  Cooley-Tukey  [9]  algorithm  was  extended  to  multiple  dimensions  in  [27,14,17]. 
A  unified  treatment  of  these  is  given  in  [22].  As  in  the  one  dimensional  czise,  the  additive 
properties  of  the  indexing  set  are  exploited  to  factor  the  MD  DFT  into  alternating 
stages  of  lower  order  MD  DFT,  twiddle  multiplication,  and  data  permutation.  This 
decomposition  process  can  be  applied  recursively  to  produce  algorithms  whose  datafiow 
is  optimized  for  a  specific  architecture  [18],  and  that  have  a  reduced  number  of  multiplies 
relative  to  the  row-column  method.  However,  edl  vairiants  of  the  multidimensional 
Cooley-Tukey  algorithm  require  alternating  stages  of  DFT  with  stages  of  data  exchange. 
Like  the  row-column  method,  extensive  interprocessor  communications  is  required. 

Reduced  transform  algorithms  (RTA)  will  be  considered  eis  alternative  methods 
of  computation.  These  algorithms  apply  number  theory  or  polynomieil  theory  to  the 
computation  of  the  multidimensional  DFT.  They  are  reductions  in  the  sense  that  they 
compute  a  d— dimensional  DFT  with  fewer  ID  DFTs  than  the  dN’^~^  ID  DFTs  re¬ 
quired  by  the  row-column  method.  These  algorithms  include  the  polynomial  transform 
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algorithm  [25,26],  the  multidimensional  AFW  algorithm  [3],  the  weighted  redundancy 
transform  (WRT)  [35,34],  and  the  linear  congruence  algorithm  [11].  Some  RTA  zire 
shown  to  possess  structures  that  do  not  intersperse  computation  stages  with  commu¬ 
nication  stages.  The  algorithms  that  satisfy  this  constraint  can  be  implemented  on 
distributed  processors  with  no  interprocessor  communication.  These  algorithms  have 
neither  the  time  nor  the  hardware  cost  associated  with  the  data  exchange  stages  of 
row-column  and  multidimensional  Cooley-Tukey  2ilgorithms. 

The  remainder  of  this  chapter  is  organized  as  follows.  First,  the  multidimensional 
DFT  is  defined,  and  the  row-column  and  multidimensional  Cooley-Tukey  algorithms 
(MD  CT)  are  reviewed.  The  row-column  algorithm  will  be  used  in  chapters  3  and  4 
as  a  comparative  measure.  The  MD  CT  algorithm  will  be  used  to  develop  the  hybrid 
algorithm  of  chapter  4.  Reduced  transform  algorithms  are  discussed,  and  a  pareimetric 
line  formulation  of  the  linear  congruence. algorithm  is  described. 


MD  DFT  Definition 

The  sununation  form  of  the  multidimensional  discrete  Fourier  transform  is  given  by 


n,— l  nj  — 1  n^—t 

•  •  •  lid)  =  S  H  •  "  E  x(kuk2, . . . , 

fcl=:0  A:2=0  kjsO 


■LJ: 


Jdl^d 


(1.1) 


0  <  ji,  ki  <  Hi,  =  e  . 

Letting  n,  =  N,  the  direct  computation  of  a  single  output  point  by  the  summation 
definition  or  by  matrix  multiplication  requires  complex  multiplies,  and  N'^  complex 
additions.  There  axe  N‘^  output  points,  so  the  evaluation  of  the  entire  expression  by 
this  method  requires  complex  multiplies,  eind  N'^N'^  complex  additions.  As  in 

the  ID  case  fast  2Jgorithms  exist  for  MD  DFT.  The  row-column  algorithm  below  is  one 
of  these. 


Row-Column  Algorithms 

The  row-colunm  algorithm  eveduates  the  n  x  n  2D  DFT  by  computing  the  ID  n— point 
DFT  of  the  rows  and  columns  of  the  2D  array.  The  nxn2D  DFT  summation  is  defined 
eis 

rO.Ji)  =  E  V  Mi;  (1-2) 

k,  =0  *2=0 

Equation  (1.2)  is  seen  to  compute  one  dimensional  n— point  DFT  on  the  rows  then 
columns  of  the  input  array.  The  row-column  method  is  extended  to  any  number  of 
dimensions  by  performing  ID  DFT  with  respect  to  each  dimension. 

There  are  diV**"’  one  dimensional  Fourier  tremsforms  required  to  evaluate  the  d— di¬ 
mensional  Fourier  transform  by  the  row-column  method.  If  they  are  computed  directly 
then  0(dN'^'^^)  arithmetic  operations  are  required.  If  AT  =  2"  and  they  are  computed 
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using  the  FFT,  then  the  complexity  is  0{dN'^  log  N).  This  represents  a  considerable 
improvement  over  the  operations  required  for  direct  evaluation  of  (1.2). 

Multidimensional  Cooley-Tukey  Algorithms 

The  row-column  method  applies  the  Cooley-Tukey  (CT)  FFT  algorithm  to  each  dimen¬ 
sion  separately.  For  the  2D  case  the  CT  FFT  is  applied  row-wise  then  column-wise. 
This  decimates  and  transforms  the  array  in  each  index  separately.  In  [14]  a  derivation  of 
an  algorithm  that  decimates  and  transforms  both  indices  simultaneously  is  given.  This 
Jilgorithm  is  a  multidimensional  Cooley-Tukey  algorithm  in  the  sense  that  it  exploits 
the  additive  properties  of  the  underlying  indexing  set  to  decompose  the  MD  DFT  into 
stages  of  lower  order  DFT,  stages  of  permutation,  and  stages  of  twiddle  multiplication. 
Below,  the  multidimensional  Cooley-Tukey  algorithm  is  shown  for  the  2D  case  and  the 
MD  case. 

The  Ni  X  N2  2D  DFT  is  given  by 

Y(kuk,)  =  E 

n,  =0  n2=:0 

For  Ni  and  N2  composite,  Ni  =  riSi  and  N2  =  r2S2  let 

n\  =  SjPi  -|-  Ui,  n2  =  S2P2  +  ^2 

0  <  Pi  <  r,,  0  <  Ui  <  Si. 

and  let 

ki  =ai+  ribi,  ^2  =  02  +  r2b2 
0  <ai  <  ri,  0  <  6,  <  Si. 

Substituting  (1.5)  and  (1.4  into  (1.3)  gives  the  two-dimensioned  Cooley-Tukey  (vector- 
radix)  decomposition  [14] 


(1.3) 

(1.4) 

(1.5) 


r(a,+r,i.„a,  +  r,6,)  =  (1-®) 

uj=0iij=0 

•  H  xisipi  -I-  Ui,S2P2  +  U2)iJr!^'u;“^^ 

pi=0pj=0 


The  decomposition  of  (1.7)  extends  naturally  to  multidimensional  treinsforms  that 
have  a  composite  number  of  points  in  each  dimension.  Let 


=  -Si^i)  ^i  =  fli  +  ^ibi,  and  rij  =  s^pi  +  Ui 


(1.7) 


where 


0  ^  Qi ,  Pi  /  i ,  0  ^  ^i ,  Ui  Si,  t  —  1 ,  2,  .  .  .  ,  d 


5 


(1.8) 


then  the  d— dimensional  DFT  given  by  (1.1)  can  be  written 


1^(01  +ri6i,. .  .,ad  +  r^bd)  = 

ui=0  Ud=0 


ri-l  r^-X 

•  +  Wl,  •  •  •  ,  SdPd  +  Ud)u}rl’’' 

Pl=0  Pd=0 


■  ■  -UJ 


adPd 

Td 


In  chapter  4  this  algorithm  will  be  coupled  with  the  hyperplane  algorithm  to  pro¬ 
duce  a  multiprocessor  algorithm  that  is  highly  scalable  and  requires  no  interprocessor 
communication. 


1.3  Reduced  Transform  Algorithms 

Reduced  transform  algorithms  (RTA)  take  their  name  from  the  fact  that  they  reduce 
the  number  of  one-dimensional  DFT  needed  to  compute  the  MD  DFT  below  the  dN'^~^ 
required  by  the  row-column  method.  These  algorithms  apply  number  theory  or  poly¬ 
nomial  theory  to  the  computation  of  the  MD  DFT.  They  include  the  Nussbaumer- 
Quandalle  polynomial  transform  algorithm  [25,26],  the  multidimensional  algorithm  [3] 
of  Auslander  et  ai,  the  weighted  redtmdancy  transform  (WRT)  [35,34]  of  Vulis  and 
Tsai,  and  the  linear  congruence  algorithm  of  Gertner  [11]. 

In  [3],  Auslander  et  al.  introduce  an  algorithm  for  the  computation  of  the  multi¬ 
dimensional  DFT.  The  algorithm  depends  on  the  underlying  indexing  set  possessing 
finite  field  properties  and  is  therefore  restricted  to  arrays  of  prime  size  in  eaich  dimen¬ 
sion.  The  WRT  edgorithm  [35,34]  depends  on  ring  properties  of  the  indexing  set,  and 
operates  on  any  M  x  N  array.  This  edgorithm  computes  the  MD  DFT  in  terms  of  a 
set  of  lines  that  cover  the  output  array.  The  output  lines  are  formed  by  summation 
from  a  set  of  ID  DFT  computed  on  lines  covering  the  input  array.  If  the  computation 
of  the  ID  DFT  are  performed  in  parallel,  then  interprocess  communication  must  occur 
to  form  the  output  lines. 

The  Nussbaumer-Quandalle  algorithm  [26,25]  uses  polynomial  tremsforms  to  map 
the  multidimensional  DFT  into  a  set  of  independent  ID  DFT,  and  the  linesir  congru¬ 
ence  algorithm  of  [11]  computes  the  2D  DFT  by  a  set  of  independent  ID  DFT  along 
linear  congruences  over  the  indexing  set.  Both  sdgorithms  operate  on  2D  square  ar¬ 
rays  of  prime,  power  of  prime,  and  product  of  prime  sizes  in  each  dimension.  Both 
of  these  algorithms  evaluate  the  2D  DFT  by  a  set  of  independent  ID  DFT  computed 
on  data  derived  from  the  input  set  using  only  additions.  Parallel  algorithms  like  those 
considered  in  this  program  may  be  developed  using  these  reduction  algorithms. 

The  remainder  of  this  section  gives  a  pareunetric  line  formulation  of  the  lineeir  con¬ 
gruence  algorithm  of  [11].  This  formulation  of  the  2D  DFT  directly  restricts  com¬ 
putation  to  a  set  of  lines  covering  the  output  grid.  Correspondence  between  the 
Nussbaumer-Quandalle  algorithm  and  the  line  algorithm  are  given  in  [1,28|. 
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The  Parametric  Form  of  the  Linear  Congruence  Algorithm 

The  restriction  of  the  2D  DFT  to  a  set  of  lines  covering  the  output  grid  is  described 
using  the  parametric  form  of  the  line  (1).  Only  the  prime  case  of  the  algorithm  is 
presented  here.  This  formulation  is  computationally  identical  to  the  linear  congruence 
jdgorithm  in  [11]. 

The  P  X  P2D  DFT  will  be  computed  by  restricting  it  to  a  set  of  lines  over  the 
indexing  set  ZjPx  ZfP.  In  general,  the  line  through  a  point  a  =  (01,02)  G  Z/Px  ZjP 
is  defined  as  the  (cyclic)  subgroup  generated  by  o 


Ii(a)  =  £((01 , 02 ))  =  {to  =  (oit,  02^)  ;t  =  0, 1,...,P— 1}.  (1-9) 

The  following  theorem  gives  a  set  of  P  point  radizd  lines  that  cover  ZjP  x  ZjP  [Ij. 

Theorem  1.3.1  The  set  o/P  +  1  lines 

£((m,l)),  m  =  0,l,...,p- 1 

I((1,0)) 


cover  Z/P  X  ZfP. 

The  P  X  P  2D  DFT  can  be  computed  by  restricting  it  to  lines  covering  the  P  x  P 
output  grid.  The  points  on  the  lines  £((m,  1))  and  £((  1, 0))  are  given  by  the  preaddition 
operations  (derivation  in  chapter  2) 

=  Y^x{i,d-mi)  (1-10) 

i=0 

4''°'  =  (1-11) 

1=0 

and  the  one  dimensional  P— point  DFT 


p-\ 


/  .  1 

d=0 

P-l 

m  =  0, 1, . 

..,P-1 

(1.12) 

K(/,0)  = 

d=0 

t  =  0,1,... 

.,P-1 

(1.13) 

Equations  (1.12)  and  (1.13)  evaluate  the  P  x  P  2D  Fourier  transform  with  P  +  1  one 
dimensioned  Fourier  transforms  on  data  derived  from  the  input  by  the  preadditions  of 
(1.10)  and  (1.11). 

In  chapter  2  this  algorithm  is  generalized  to  permit  the  computation  of  multidimen¬ 
sional  discrete  Fourier  transform  restricted  to  hyperplanes.  The  generedized  algorithm 
is  used  in  the  design  of  multiprocessor  algorithms  in  chapters  3  and  4. 
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1.4  Report  Organization 

This  report  is  organized  as  follows:  Algorithms  for  MD  DFT  computation  were  re¬ 
viewed  in  1.2.  The  hyperplane  algorithm  required  by  this  program  is  developed  in 
chapter  2.  A  direct  mapping  of  this  algorithm  to  the  broadcast  mode  multiprocessor  is 
given  in  chapter  3.  The  resulting  multiprocessor  algorithm  requires  no  interprocessor 
commimication,  but  does  not  scale  flexibly.  A  hybrid  algorithm  that  marries  MD  CT 
methods  with  hyperplzuie  methods  is  given  in  chapter  4.  Like  the  direct  mapping  algo¬ 
rithm,  the  hybrid  algorithm  requires  no  interprocessor  communications.  It  also  hats  the 
further  benefit  of  being  highly  scalable.  The  results  of  this  program  are  summarized  in 
the  fined  chapter. 
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Chapter  2 


An  Algorithm  for  the  Computation  of  the  MD  DFT  on  Hyperplanes 


Introduction 

The  multiprocessor  algorithms  developed  in  this  program  depend  on  the  reduced  trans¬ 
form  algorithm  (RTA)  presented  in  this  chapter.  The  algorithm  computes  a  d— dimen¬ 
sional  DFT  by  a  set  of  independent  fc— dimensional  DFT;  it  is  a  reduction  zdgorithm  in 
the  sense  that  it  has  lowered  the  dimension  of  the  Fourier  tramsforms  being  computed. 
The  fc— dimensional  DFT  are  performed  on  data  derived  from  the  input  data  using  only 
additions,  and  produce  A:— dimensional  hyperplanes  of  the  output. 

The  algorithm  is  derived  by  restricting  the  d— dimensional  DFT  to  a  collection  of 
subgroups  of  its  output  indexing  set.  In  order  to  describe  these  subgroups  in  a  natural 
manner,  the  notion  of  the  discrete  fc— dimensional  hyperplane  is  employed.  In  terms  of 
these  hyperplanes  the  development  of  the  algorithm  is  undertaiken  in  two  parts.  The 
first  part  defines  a  minimal  set  of  fc— dimensional  hyperplanes  that  cover  the  d— di¬ 
mensional  array.  The  second  part  restricts  the  d— dimensional  DFT  to  each  of  the 
fc— dimensional  hyperplanes  of  the  covering  set.  The  restrictions  are  shown  to  evaluate 
as  independent  fc— dimensional  DFT. 

The  algorithm  is  computationally  similar  to  the  reduced  treinsform  algorithms  of 
Auslander  etai  [3],  Nussbaumer  and  Quandeille  [26,25],  and  Gertner  and  Tolimieri 
[11,12].  Like  those  algorithms  it  computes  the  midtidimensioned  DFT  by  a  preaddition 
stage  followed  by  a  stage  of  independent  DFT.  The  approach  presented  here  computes 
the  MD  DFT  on  A:— dimensional  hyperplanes  of  the  output  array.  It  is  a  gener2dization 
and  extension  of  the  linear  congruence  algorithm  of  [11,12].  A  correspondence  between 
the  line  and  polynomial  algorithms  is  given  in  [1,28]. 

The  motivation  for  developing  this  algorithm  is  the  generation  of  a  parallel  sJgorithm 
for  MD  DFT  computation  that  permits  flexible  scaling  to  the  degree  of  parallelism 
of  the  target  architecture.  In  later  work  a  broadcast  mode  multiprocessor  algorithm 
based  on  the  algorithm  presented  here  will  be  demonstrated.  The  main  features  of 
that  algorithm  are  that  it  (1)  requires  no  interprocessor  commimication,  and  (2)  that 
it  scales  to  match  the  degree  of  parallelism  of  the  target  processor. 

The  remainder  of  the  presentation  is  organized  as  follows:  first  the  definitions  per¬ 
tinent  to  the  generation  of  A:— dimensional  hyperplanes  are  stated,  then  a  minimal  set 
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of  covering  fc— dimensional  hyperplanes  is  derived  for  the  cases:  where  the  array  is  of 
equal  and  prime  size  in  at  least  d  —  k  +  I  dimensions,  and  where  the  array  is  of  equal 
and  power  of  prime  (including  2)  size  in  at  least  d—  k  +  1  dimensions.  For  each  case  the 
restriction  of  the  d— dimensional  DFT  to  the  covering  set  of  A:— dimensional  hyperplanes 
is  given.  Both  cases  are  shown  to  reduce  the  computation  to  a  set  of  independent  DFT 
performed  on  data  derived  from  the  input  data  using  only  additions.  An  appendix  is 
provided  that  enumerates  3D  and  4D  ca^es  of  the  algorithm. 

2.1  Definitions 

This  section  presents  definitions  pertinent  to  the  derivation  of  the  algorithm.  The  algo¬ 
rithm  restricts  the  computation  of  the  d— dimensional  DFT  to  a  collection  of  subgroups 
of  the  output  array.  These  subgroups  cire  defined  on  the  underlying  indexing  set  of  the 
d— dimensional  DFT. 

The  index  set  of  the  Ni  x  ■  •  ■  x  Nd  d— dimensional  DFT  is  associated  with  the  ring 


C  =  Z/Ni  X  •  •  •  X  Z/Nd, 


where  Z/Ni  is  the  ring  of  integers  modulo  iV,-.  This  set  defines  the  region  of  support  of 
the  function.  A  point  in  the  index  set  is  taken  to  be  the  vector  u  €  C  defined  by  the 
d— tuple 

u  =  («i,...,Urf),  me  Z/Ni. 

The  value  of  the  DFT  restricted  to  a  point  «  in  the  output  array  is  evaluated  in  the 
natural  way  as 


iVi-l  Na-l 

j:  •••  Y.  . 

ai=0  ad=0 

for  a  fixed  ueC. 

The  discrete  line  through  a  point  in  a  two-dimensional  array  a  =  (01,02)  G  Z/N  x 
Z/N  \s  defined  as  the  (cyclic)  subgroup  additively  generated  by  a 

L{a)  =  {so  =  (501,502) :  5  G  Z/N}.  (2-1) 

The  line  is  said  to  be  of  full  order  if  |I'(a)l  =  N. 

This  definition  of  a  line  extends  naturally  from  the  case  where  the  line  lies  in  a 
two-dimensional  grid  to  the  case  where  the  line  lies  in  a  d— dimensional  array.  The  line 
through  a  point  a  G  {Z/NY  is  defined  by  the  subgroup  [12] 


L{a)  =  {50  =  (501, . . .  ,50d)  :  5  G  Z/N).  (2.2) 

By  definition  the  line  L(a)  is  additively  generated  by  o;  L{{a))  contains  all  linear 
combinations  of  the  point  o  G  {Z/NY-  Mu)  defines  the  additive  closure  of  o  G  {Z/NY- 
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In  a  majiner  similar  to  the  way  L(a)  defines  the  closure  of  a  point  a  G  {Z/NY, 
we  may  define  the  closure  of  a  set  of  points  £  {Z/NY-  The  closure  of 

Oi, . . . ,  €  (ZfNY  given  by  the  subgroup 

H{ai,. . .  ,ajt)  =  {•si«i  H - H  SkOje  ■  ^  ^/^}-  (2-3) 


H{ai,  ■  ■ .  ,ajt)  contains  all  linear  combinations  of  the  points  in  the  set  {aj, . . . 

The  idea  of  dimension  can  be  associated  with  the  subgroups  generated  in  this  way. 
The  dimension  of  a  subgroup  of  {ZfNY  is  taken  to  be  the  number  of  linearly  inde¬ 
pendent  points  G  {Z/NY  required  to  generate  it  by  the  definition  of  (2.3).  In  this 
manner,  subgroups  generated  by  the  additive  closure  of  1,  2,  and  3  linearly  independent 
points  will  be  called  lines,  planes,  and  cubes  respectively.  Similarly,  a  subgroup  whose 
dimension  is  d  —  1  is  called  a  hyperplzme.  If  the  dimension  of  a  subgroup  is  A:  <  d  it  is 
called  a  fc— dimensional  hyperplane. 

The  requirement  of  (2.3)  that  the  array  be  of  equal  size  in  each  dimension  is  overly 
restrictive  for  the  development  of  the  algorithm.  It  will  be  seen  that  it  is  only  necessary 
for  the  array  to  be  of  equal  size  in  d  —  A:  -f- 1  dimensions. 

Consider  a  d— dimensional  array  whose  region  of  support  is  given  hy  C  =  ZfNi  x 
•  •  •  X  ZfNii.  Assume  that  d  —  A:  -f-  1  dimensions  are  of  equal  size,  and  for  simplicity 
of  presentation  allow  those  dimensions  to  be  contiguous.  Write  Nk  =  •  •  •  =  Nd-  The 
definition  adopted  for  the  presentation  of  the  algorithm  uses  the  subgroup  ifjfe(a)  formed 
by  the  closure  of  the  vector  a  G  iZ  with  the  A:  —  1  standard  basis  vectors 


^.0)  = 


{ 


1, 

0, 


if  i  =  j 

otherwise  ’ 


i  =  1, . . . ,  A:  —  1. 


(2.4) 


The  subgroup 


Hkio)  =  +  SkQji :  Si  G  Z/Ni}  (2.5) 


will  define  the  A:— dimensional  discrete  hyperpleme.  For  the  purpose  of  the  development 
of  the  algorithm  we  will  only  be  concerned  with  A:— dimensional  hyperplanes  of  full 
order.  That  is  (iVj  •  •  •  Nk)  points.  As  an  example,  consider  the  4D  array  (Z/N)*.  The 
hyperplane  generated  by  the  point  a  =  (0,0, 1, 1)  is  given  by 

H3((0,0,  1,1))  =  {(si,S2,S3,S3)  :  5,  G  Z/N}, 

and  ff3((0,0, 1, 1))  contains  points.  Similarly,  the  hyperplane  generated  by  the 
point  a  =  (0,0,i,y)  is  given  by 

^3((0,0,x,y))  =  {(si,S2,53X,S3y)  :  Si  G  Z/N}. 

A  set  of  A:— dimensional  hyperplanes  is  said  to  cover  the  d— dimensional  array  C, 
if  every  point  in  C  is  in  at  least  one  A:— dimensional  hyperplane  of  the  set.  A  set  of 
covering  A:— dimensional  hyperplanes  is  minimal  if  there  is  no  smaller  set  of  A:— dimen¬ 
sional  hyperplanes  that  also  covers  C. 
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2.2  Prime  Case 


This  section  derives  the  prime  case  of  the  algorithm.  For  this  case  the  d— dimensional 

DFT  is  defined  over  the  region  C  =  ZjNx  x  •  •  ■  x  Z/Nj  where  C  is  of  equal  and  prime 

size  in  d  —  +  1  dimensions.  That  is, 

C  =  Z/Nx  X  •  •  •  X  2/iVj  ^AxB, 

A  =  Z/Nx  X  •  •  •  X  Z/Nk-x  X  ZIP,  (2.6) 

B  =  (ZIPY~^ ,  and  P  is  a  prime. 

The  algorithm  is  derived  in  two  parts.  The  first  part  is  the  specification  of  a  minimal 
set  of  fc— dimensional  hyperplanes  that  cover  C.  This  follows  immediately  below.  The 
second  part  of  the  derivation  requires  the  d— dimensional  DFT  to  be  restricted  to  each 
of  the  fc— dimensional  hyperplanes  of  the  covering  set.  The  resulting  algorithm  is  shown 
to  be  computed  by  a  set  of  independent  A:— dimensional  DFT.  This  Ccise  of  the  edgorithm 
is  specified  by  the  procedure  of  equations  (2.20)  and  (2.21). 

Covering  Hyperplanes 

The  following  theorem  gives  a  minimal  set  of  |>41  point  A:— dimensional  hyperplanes  that 
cover  R.  Let  ^(j)  denote  the  vector  of  j  zeros,  ie.  ^(3)  =  (0,0,0).  The  A:— dimensional 
hyperplanes  are  defined  by  equation  (2.5). 

Theorem  2.2.1  The  set  of  k~ dimensional  hyperplanes 

Hq  =  Hk{(i(k-\)A,mi,...,mci-k))  mi  6  Z/P 

Hx  =  /ffc((^(fe-i),0,l,m2,...,md_jt)) 

n^-k  =  Hk{(^^k-x)M...,0,l)) 

covers  the  d— dimensional  array  C  =  A  x  B,  where  A  =  Z/Nx  x  •  ■  ■  x  Z/Nk-x  x  Z/P, 

B  =  (Z/P)‘^~'’,  and  P  is  a  prime.  There  are 

pd-k+l  _  2 

P-1 

k— dimensional  hyperplanes  in  the  set. 

Proof.  Every  a  =  (oi, .  •  • ,  oj)  €  C  can  be  written 

0.  =  0(1)  H - +  ^ife-i)  + 

where  is  the  vector  whose  d*  component  is  a,  and  is  zero  elsewhere,  and  a'  is  the 
vector  a'  =  (0(fc_i),  a*, . . . ,  Cd). 

By  definition  of  Hkia),  is  in  every  hyperplane  of  the  covering  set  for  i  =  1, . . . ,  k— 
1.  Since  the  hyperplanes  are  closed  under  addition,  covering  is  shown  if  every  a'  is  in 
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at  least  one  hyperplane  of  the  covering  set.  If  ajt  ^  0  then  a/t  is  invertible  modulo  P 
and  a'  may  be  rewritten 


and  a'  is  seen  to  be  contained  in  the  union  of  the  hyperplanes  of  the  set  Ho  of  the 
theorem.  By  closure,  the  point  a  =  a*  +  •  •  •  +  +  g'  must  also  be  contained  in  the 

union  of  the  hyperplanes  of  the  set  Hq.  There  are  fc— dimensional  hyperplanes  in 

Ho. 

Repetition  of  this  step  until  all  the  remaining  d  —  k  components  of  a'  are  exhausted 
completes  the  proof.  The  a'*  set  of  hyperplanes  generated  contains  p<^-*-«  ^—dimen¬ 
sioned  hyperplanes.  In  all  there  are 


pd-k+\  _  j 

P  -  1 

fc— dimensional  hyperplanes  in  the  covering  set. 

□ 

The  A:— dimensional  hyperplanes  of  the  covering  set  of  theorem  2.2.1  contain  redim- 
dant  points.  The  number  of  points  in  C  is 


Id  =  ATf.iVfc.t  .P‘'-''+‘. 

The  number  of  points  in  each  dimensional  hyperplane  is 

M|  =  ivr,..-iVfc_,.p, 


and  the  number  of  redundant  points  is 


The  locations  of  the  redundant  points  are  given  by  the  (fc  —  1)  dimensional  hyperplane 


^(^1, . . .  =  {(5,, . . . , Sk-x^eid -  k +  1)):  Si  e  Z/Ni}  (2.7) 

where  H{6i, . . .  ,^_i)  is  defined  by  equation  (2.3).  The  elements  of  (2.7)  are  common 
to  every  hyperplane  of  theorem  2.2.1,  and  are  redimdant 


P- 


times,  accounting  for  all  redundant  points. 
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DFT  Restriction 

The  d— dimensional  DFT  can  be  computed  by  restricting  it  to  the  ^-—dimensional  hy¬ 
perplanes  of  theorem  2.2.1.  Below,  the  restriction  of  the  MD  DFT  to  those  hyperplanes 
is  derived. 

The  d— dimensional  DFT  over  C  =  Z/Ni  x  Z/Nj  is  given  by 

F(ti)  =  5:  a  e  C.  (2.8) 

a  ec 

The  prime  case  of  the  algorithm  requires  d  —  k  +  1  dimensions  to  be  of  equaJ  and  prime 
size.  For  simplicity  of  presentation  assmne  these  are  contiguous  and  write 

C  =  AxB,  ■ 

A  =  Z/Nx  X  •  •  •  X  Z/Nk-x  X  P, 

B  —  {ZIPY~'‘.,  and  P  is  a  prime. 

The  P*^"*  A:— dimensional  hyperplanes  of  Tio  given  by  theorem  2.2.1  can  be  defined  by 
the  homomorphism 

(2.9) 

For  all  1  €  >1,  ^0  is  given  by 

^o(s)  =  (2.10) 

The  restriction  of  the  d— dimensional  DFT  (2.8)  to  the  fc— dimensional  hyperplanes  of 
Tio  is  evaluated  by  replacing  u  with  $o(s)  of  (2.10)  to  obtain 

^(^o(s))=  (2-11) 

oec 

A. 

The  inner  product 


^o(s)  •  a  =  SxQx  s*_iafc_i  -f  sic{ak  +  Ok+i^i  4-  •  •  •  +  cid^d-k)  (2-12) 

can  be  written 

^o(l)  •  a  =  sidi -1 - l-5jfcdfc  =  s-d  (2.13) 

d€A 

where 

dj  =  Qj ,  j  —  1,...,Aj  1 

ij.k=ah  i  =  fc -h  1,...  ,d  (2.14) 

dfc  =  a* -I- miCfc+i -f  •  •  • -h  mj.jtcj  . 
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Substituting  (2.13)  in  (2.11)  allows  the  d— dimensional  DFT  restricted  to  the  fc— di- 
mensionzd  hyperplanes  of  Ho,  to  be  written 

V($o(s))  =  E  (2-15) 

d(iA  i€8 


Where 


deA. 

Ok  =  dk  -  m-  i, 


(2.16) 


and 

d(k-c)  =  {(^\^---‘>(ik-c)-  (2-17) 

Equation  (2.15)  can  be  rewritten  as  a  reduction  stage,  followed  by  a  DFT  stage.  The 
reduction  operation  is  given  by 

^  X  {d^k-l),  dfc  -  m  •  i,  i)  ,  deA  (2-18) 

i€8 


and  the  DFT  stage  is  given  by 

V^(^o(l))  =  E  (2.19) 

deA 

5  G  >1,  meB. 

Equations  (2-.18)  and  (2.19)  give  the  values  of  the  d— dimensioned  DFT  on  the  A— di¬ 
mensional  hyperplanes  of  the  set  Hq.  There  is  one  such  hyperplane  for  every  meB. 

In  a  similar  manner,  the  restriction  of  the  d— dimensional  DFT  to  each  of  the  re¬ 
maining  ib— dimensional  hyperplanes  of  the  covering  set  of  theorem  2.2.1  is  computed 
by  a  reduction  operation  followed  by  a  A:— dimensionad  DFT  of  those  points. 

The  complete  algorithm  is  stated  below  for  the  case  where  the  d— dimensional  DFT 
over  C  is  of  prime  size  in  at  least  d  —  A:  -|- 1  dimensions.  Then  C  is  given  by 

C  =  AxB, 

A  =  Ni  X  ■  ■  ■  X  Nk-i  X  P, 

B  =  {Z/PY~'‘,  and  P  is  a  prime. 

Step  1  Reduction  stage.  This  stage  requires  the  evaluation  of  the  summations 

d-k 

aj®  =  53  x(d(jt_,),  dfc  -  i),  (2.20) 

leo  '=i 


d-k 

53  ^j)’  ~  ^3  ^j+li  •  •  ■  Ad-k')‘i 

i€B  l=j+l 


=  E  ^idik-i),  i,  dk), 

les 
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d£  A,  mj  6  ZjNi 


Step  2  DFT  stage.  This  stage  requires  the  evaluation  of  the  A;— dimensional  DFT 


ViMs)) 

_  n^O,  ,Sldl 

-  2^  ^d  ^Ni  • 

,  ,3k<ik 

deA 

ViHi.)) 

—  Z^^d  ‘^Ni 

,  ,»k<ik 
^Nk  ’ 

V{^d-k{s)) 

—  Z^^d  ^Ni 

,  .>k<ik 
^Nk 

deA 

se  A. 


In  general  is  given  by 

^j(§.)  —  (-Sl  1  •  •  •  t  ^k—l  1  ^(.7  )>  ^kt  ^k^^j-i-l  >  •  •  •  )  it)> 


(2.21) 


mj  6  ZfNi. 


The  d— dimensional  DFT  over  C  =  Ax  B  is  seen  to  be  computed  by 

pd-k-ki  _  I 

P-1 

independent  A:— dimensional  DFT  over  A.  The  A:— dimension^ll  DFT  are  eveduated  on 
data  derived  from  the  input  data  using  only  additions. 

2.3  Power  of  Prime  Case 

This  section  derives  the  power  of  prime  case  of  the  algorithm.  Like  the  prime  case, 
the  power  of  prime  case  is  developed  in  two  parts.  The  first  pau:t  is  the  specification  of 
a  minimal  set  of  covering  A:— dimensional  hyper  planes.  The  second  part  is  the  restric¬ 
tion  of  the  d— dimensional  DFT  to  each  of  the  hyperplanes  of  the  covering  set.  The 
derivation  of  the  algorithm  follows  below.  The  algorithm  is  stated  by  the  procedure  of 
equations  (2.34)  and  (2.35). 

Covering  Hyperplanes 

Consider  the  d— dimensional  DFT  defined  over 

C  =  Z/Ni  X  •  •  •  X  Z/Nd 
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where  C  is  of  equal  and  power  of  prime  size  in  at  least  d  —  A'  -f-  1  dimensions.  For  ease 
of  presentation  these  dimensions  are  taken  to  be  contiguous.  That  is, 

C  =  Ax  B, 

A  =  Z/Nx  X  •  ■  •  X  Z/Nk-i  X  Z/P-,  (2.22) 

B  =  {Z / ,  and  P  is  any  prime  including  2. 

The  following  theorem  gives  a  minimal  set  of  |>ll  point  A— dimensional  hyperplanes  that 
cover  C.  Let  denote  the  vector  of  j  zeros,  ie.  ^3)  =  (0,0,0).  The  A— dimensional 
hyperplanes  are  defined  by  equation  (2.5). 

Theorem  2.3.1  The  set  of  k— dimensional  hyperplanes 

Hq  =  Affc((^(fc-i),l,m,,...,md_fc)),  m,  6  Z/P" 

Hx  =  P*;((^(fc-i),riP,l,m2,...,mj_fc)),  rieZ/P''-'^ 

Hd-k  =  Hk{{Q.{k-i),Pri,Pr2,....,Prd-kA)) 

covers  the  d—  dimensional  array  C  =  A  x  B,  where  A  =  Z/Ni  x  ■  •  ■  x  ZfNk-i  x  ZjP'^, 
B  =  {Z ! P^Y~'‘ ,  and  P  is  any  prime  including  2.  There  are 

,  pn\<i-k  f  pd-k+l  _ 

Vp";  V  p-i  j 

k  —  dimensional  hyperplanes  in  the  set. 

Proof.  Similar  to  theorem  2.2.1. 

□ 

The  covering  set  of  theorem  2.3.1  contains  redundant  points.  Altogether  there  are 


such  points.  The  following  theorems  fully  describe  the  redundeincy  between  £iny  two 
hyperplanes  of  the  covering  set  of  theorem  2.3.1.  Throughout  the  sequel,  the  notation 
Xmody  is  taken  to  be 

—  mody  (^1  mod  y  ’  •  •  •  »  ^”mod  y  ) 

where  x  is  the  vector  x  =  (xi . . . ,  x„)  2ind  y  is  a  scaleur. 

Theorem  2.3.2  The  redundancy  between  any  two  k— dimensional  hyperplanes  of  the 
set  'Ho  of  theorem  2.S.1  is  given  by 

Hk{ii(k-i),  Lm))  n  Hk{{i(k-i),  1,;))  = 

/ffc((^(fc-,),P'‘-“,(m„,„dP-)^"'“)),  €  ZjP^. 

Where  a  is  the  maximum  such  that 

rrii  =  ji  mod  P" 

holds  for  all  i  =  1., ...  ,d  —  k.  The  intersection  contains  {Nx  •  ■  ■  Nk-x)P°'  points. 
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Theorem  2.3.3  The  redundancy  between  any  two  k— dimensional  hyperplanes  of  the 
set  'H(,  I  =  1, . . .  ,d  —  k  —  I,  of  theorem  2.3.1  is  given  by 

1,  m.))  n  1,  ;))  = 

where 

r^,q^  e  ZIP"--^,  i  =  l,...,l 
mi,j,  e  ZIP'',  i  =  l-\-l,...,d- k 

and  a  is  the  maximum  such  that 


ri  —  qi  mod  P“ 


and 


TTih,  =  jh  mod  P" 


hold  for  alii  I, . . .  ,l  and  h  =  /+1, . . .  ,d—k.  The  intersection  contains  {N\  •  •  ■  Nk-i)P°' 
points. 


Theorem  2.3.4  The  redundancy  between  any  two  k— dimensional  hyperplanes  of  the 
set  Tid-k  of  theorem  2.3.1  is  given  by 

Pr,  D)  n  Pi,  1))  = 


where 

ri,qi  €  ZfP'^~^ 

and  a  is  the  maximum  such  that 


r,  =  qi  mod  P“ 

holds  for  all  i  =  1, . . .  ,d  ~  k.  The  intersection  contains  {N\  ■  ■  •  Nk-i)P°‘  points. 

Theorem  2.3.5  The  redundancy  between  any  two  k— dimensional  hyperplanes  of  the 
sets  Hj;  and  Tiy,  where  x  ^  y  is  given  by 

Hk{{.B(k-x),r,  1,  m))  n  1,  ;))  =  Hk{ix,..  •  ,^-l,^(<l-fc+l))• 

TAe  intersection  contains  {Nt  •  ■  ■  Nk-] )  points. 
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DFT  Restriction 

In  a  manner  similar  to  the  prime  case  described  in  the  previous  section,  the  d— dimen¬ 
sional  DFT  can  be  computed  by  restricting  it  to  the  covering  set  of  A:— dimensional 
hyperplanes  given  by  theorem  2.3.1.  This  restriction  is  detailed  below. 

The  power  of  prime  case  of  the  algorithm  requires  d  —  k  +  I  dimensions  to  be  of 
equal  and  power  of  prime  size  (including  2).  For  simplicity  of  presentation  these  are 
assumed  to  be  contiguous,  and  we  write 

C  =  AyB, 

A  =  Z/Ni  X  •  •  •  X  Z/Nk-i  X  Z/P", 

B  =  {Z / ,  and  P  is  any  prime  (including  2). 

The  restriction  of  the  DFT  to  the  A:— dimensional  hyperplanes  of  the  set  Tio 

is  identic8il  to  that  of  the  prime  case  derived  in  the  previous  section.  To  illustrate  the 
derivation  of  the  other  terms,  the  set  H\  of  theorem  2.3.1  is  considered  below. 

The  (1/P)(P")‘^“*^  fc— dimensional  hyperplemes  of  Hi  of  theorem  2.3.1  can  be  defined 
by  the  homomorphism 

QiiA^C.  (2.23) 

For  all  s  G  fli  is  given  by 

=  {si,...,Sk-uSkPri,Sk,Skm2,....,Skmd.k)-  (2.24) 

The  restriction  of  the  d— dimensional  DFT  to  the  fc— dimensional  hyperplanes  of  Tdi  is 
evaluated  by  replacing  u  with  f2i(s)  of  (2.24)  to  obtain 

(2-25) 

oec 


The  inner  product 


s  ^  A. 


gives 


where 


^^i(5)‘0  =  siOi -|-  —  -|- 

+-s*(afcriP  +  Qk+i  -I-  ak+2Tn2  H - 1-  a^mj-fc) 


(2.26) 


(^)  •  a  =  -sidi  -| - -I-  Skdk  =  s-  d 

deA 


(2.27) 


dj  —  Qj,  j  —  1 , . . . ,  1 

*1  =  0*1 

ij.k=aj,  j  =  k  +  2,...,d 

dk  =  OjtriP  +  ajt+i  -I-  ak+2Jn2  H - 1-  m^-kad  . 


(2.28) 
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Equation  (2.25)  may  then  be  written 

HeAi&B 


(2.29) 


d-k 


•^(^(fc— 1)'  ^  '  m/l/,  l2i  ...  1  ^d—k\ 

1=2 

de  A. 


Where 


and 


d-k 


afc+i  =  dk-  iiTiP  -  ^  m/2/, 
1=2 

d(k-c)  =  (di,...  ,dk-c). 


(2.30) 


(2.31) 


Equation  (2.29)  can  be  rewritten  as  a  reduction  stage,  followed  by  a  DFT  stage.  The 
reduction  operation  is  given  by 


Oj*  =  Y^x  2i,  dk  -  i\riP  ~  ^t/m/,  i2,...,id-k  )  ,  de  A 

i€S  V  1=2  / 


(2.32) 


and  the  DFT  stage  is  given  by 

Vms))  =  •••<"*.  (2-33) 

d<£A 

s  G  -4,  m^B. 

Equations  (2.32)  and  (2.33)  give  the  values  of  the  d— dimensional  DFT  on  the  di¬ 
mensional  hyperplanes  of  the  set  Tii . 

In  a  similar  manner,  the  restriction  of  the  d— dimensional  DFT  to  each  of  the  re¬ 
maining  fc— dimensional  hyperplanes  of  the  covering  set  of  theorem  2.3.1  is  computed 
by  a  reduction  operation  followed  by  a  A:— dimensional  DFT  of  those  points. 

The  complete  algorithm  is  stated  below  for  the  case  where  the  region  of  support, 
C,  is  of  power  of  prime  size  in  at  least  d  —  A:  -I- 1  dimensions.  Let  C  be  given  by 


C  =  AxB, 

A  =  Ni  X  ■  ■  ■  X  Nk-i  X  P", 

B  =  {ZfP^Y~^,  zmd  P  is  a  prime  (including  2). 

For  this  case,  the  algorithm  is  specified  by  the  following  procedure; 

Step  1  Reduction  stage.  This  stage  requires  the  evaluation  of  the  summations 

/  d-k 

a?*  =  E (^(*-1)’  i 

i€0  V  i=i 


(2.34) 
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(J  d-k  \ 

d{k~\),  i(j),  dk  -'^itriP  -  , 

/=i  i=j+i  ) 

aj""*  =  Y  ^  (d-ik-t]^  h  dk-Y^i^‘P 

ieB  \  (=1 

Step  2  DFT  stage.  This  stage  requires  the  evaluation  of  the  Ar— dimensional  DFT 


I’Cflois))  = 

deA 

(2.35) 

v'(D,u))  = 

d^A 

.  .  . 

s€  A. 

The  d— dimensional  DFT  over  C  =  v4  x  B  is  seen  to  be  computed  by 

/  pn\  pd-k+l  _  j 

It  j  p- 1 

independent  A:— dimensional  DFT  over  A.  The  A:— dimensional  DFT  eire  evaluated  on 
data  derived  from  the  input  data  using  only  additions. 
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2.4  Appendix 

This  section  enumerates  the  3D  and  4D  cases  of  the  algorithm. 


•  3D  — >  2D,  Prime  Case. 

N  X  P  X  P  3D  DFT  computed  by  X  x  P  2D  DFT, 
where  P  is  prime  and  N  is  any  number. 

1.  Covering  planes 

Ho  = //2((0. 1,  m)),  m  =  0, 1, . . . ,  P  —  1 
H,  =//2((0,0.1)). 


2.  Reduction  stage 

P-i 

^  ^2  -  im,  i),  m  =  0, 1, . . .  ,P  -  1 

1=0 

x(d,,  i,  dj). 

t=0 

=  o,i,...,iV  - 1 

d2  =  0,1,...,P-  1 

3.  N  xP  2D  DFT  stage 

J,=0  d2=0 

F(si,  0,  S2)  = 

«i,=0<lj=0 


s,  =0,l,...,.v-l 

S2  =0,1 - P-1 


m  =  0, 1, . . . ,  P  —  1. 


The  computation  of  the  iV  x  P  x  P  3D  DFT  by  this  method  requires  P  +  1  reduction 
operations  zmd  a  like  number  of  iV  x  P  2D  DFT.  Note  that  [33]  is  a  source  of  0(P  log  P) 
prime  size  FFT  algorithms. 

To  compute  a  3D  DFT  by  the  line  approach  each  dimension  would  have  to  be  of 
equzd  size.  The  corresponding  case  is  a  P  x  P  x  P  3D  DFT  computed  as  P*  +  P  +  1 
reduction  operations  and  a  like  number  of  P  point  ID  DFT. 
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•  3D  — >  2D  ,  Power  of  Prime  Case. 

iV  X  P”  X  P”  3D  DFT  computed  by  JV  x  P"  2D  DFT, 
where  P  is  prime  (including  2)  and  N  is  any  number. 


1.  Covering  planes 


'Ho  =  H2{{0,l,m)),  m  =  0, 1,...,P"  -  1 
Hi  =  p2((0,rP,l)),  r  =  0,l,...,P’‘-i  -1. 


2.  Reduction  stage 


a 


no 


a 


Hi 

(dudi) 


P"-l 

x(di,  d2-  irn,  i), 

1  =  0 


pn_l 

Y  ■^(di,  J,  dj  -  rPi), 
1  =  0 


m  =  0, 1, . . . ,  P"  —  1 
r  =  0,l,...,P"-‘-l 


d,  =0,l,...,iV  -  1 
d2  =0,1,...,P'‘  -  1 


3.  iV  X  P"  2D  DFT  stage 


F(si,  sj,  sjm)  =  ff 

(<,=0  rf2=0 

V{SurPs2,S2)  =  EE 

(j|=0  (^3=0 


51  =  0, 1,. . .  ,iV  -  1  m  =  0, 1, . . .  ,P"  —  1, 

52  =  0, 1, . . . ,  P"  -  1  r  =  0, 1, .  •  • ,  P’'"*  -  1. 


The  computation  of  the  iV  x  P”  x  P"  3D  DFT  by  this  method  requires  P"  +  P"“‘ 
reduction  operations  and  a  like  number  of  2D  iV  x  P"  DFT. 

To  compute  a  3D  DFT  by  the  line  approach  each  dimension  would  have  to  be  of 
equal  size.  The  corresponding  case  is  a  P’*  x  P"  x  P"  DFT  computed  by  (P")^(l  + 
1/P  +  1/P^)  reduction  operations  and  a  like  number  of  P"  point  ID  DFT. 
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•  4D  — *  3D  ,  Prime  Case. 

Ni  X  Ni  X  P  x  P  4D  DFT  computed  by  jVj  x  iVj  x  P  3D  DFT, 
where  P  is  prime  and  Ni ,  N2  are  any  numbers. 

1.  Covering  cubes 

Ho  =  /^3((0,0,l,m)),  m  =  0,l,...,P-l 
=  P3((0, 0,0,1)). 

2.  Reduction  stage 

p-i 

x{di ,  d2,  da  -  im,  i),  m  =  0, 1, . . . ,  P  -  1 
1=0 
P-i 

^  ^  d2,  da) 

1=0 

d,  =0,l,...,iV,  -  1 
d2=0,l,...,iV2-l 
da  =  0,l,...,P-l 

3.  iV  X  iV  X  P  3D  DFT  stage 

-52,  ^“^3,  53m)  =  E'  e'  E  “S  a 

d\^0  <^2=0  (i3=0 

FCsi,  S2,  0,  S3)  =  E  e'  E 

(^2—0  ^3=0 


^(di4243) 
^{dl42  43)  - 


51  =0,  l,...,iVi  -1, 

52  =  0, 1, . . .  ,iV2  —  1,  m  =  0, 1, . . .  ,P  —  1 

53  =  0, 1,. ..  ,P  -  1 

The  computation  of  the  iVi  x  iVa  x  P  x  P  4D  DFT  by  this  method  requires  P  +  1 
reduction  operations  and  a  like  number  of  3D  Ni  x  N2  ><■  P  DFT.  Note  that  [33]  is  a 
source  of  O(PlogP)  prime  size  FFT  algorithms. 

To  compute  a  4D  DFT  by  the  line  approach  each  dimension  would  have  to  be  of 
equal  size.  The  corresponding  case  isaPxPxPxP  DFT  computed  21s  P^  +  P*  +  P+ 1 
reduction  operations  and  a  like  number  of  P  point  ID  DFT. 
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•  4D  — ►  3D  ,  Power  of  Prime  Case. 

iVi  X  iV2  X  P"  X  P"  4D  DFT  computed  by  Ni  x  A^2  x  P"  3D  DFT, 
where  P  is  any  prime  (including  2)  and  N\,N2  are  any  numbers. 


1.  Covering  cubes 


Wo  =  /f3((0,0,l,m)),  m=0,l,...,P"-l 

n,  =  P3((0,0,rP,l)),  r  =  0,l,...,P"-‘  -  1. 


2.  Reduction  stage 


a 


Ho 

(di ,d2,d3) 


a 


-Hi 

(dl ,d2,d3) 


pn_i 

^  x{di,  dj,  da  -  irn,  i), 

.=0 


pn_I 

^  T(di,  da,  I,  da  -  rPi), 

i=0 


m  =  0, 1, . . . ,  P"  —  1 
r  =0,l,...,P"-'-l 


di  =0,1,. -  1 
d2=0,l,...,iV2-l 
d3=0,l,...,P”-l 


3.  iVi  X  iVa  X  P"  3D  DFT  stage 

V'(3i,  Si,  sa,  .Sam) 

F(si,  52,  rPsj,  S3) 


Ni~l  N2-1  P"-! 

E  E  E 

(ijstO  (^3=0 
N2-I 

E  E  E 

di=0  (<2=0  (i3=0 


51  =  0, 1, . . . ,  iVi  —  1 

52  =  0, 1, . . . ,  iVa  —  1 

53  =  0,1,...,P”-1 


m  =  0,1,...,P"  -  1 
r  =  0,1,...,P"-*  -1 


The  computation  of  the  Ni  x  N2  x  P"  x  P"  4D  DFT  by  this  method  requires  P  +  P 
reduction  operations  and  a  like  number  of  3D  Ni  x  N2  x  P”  DFT. 

To  compute  a  4D  DFT  by  the  line  approach  each  dimension  would  have  to  be 
of  equal  size.  The  corresponding  case  is  a  P"  x  P"  x  P"  x  P"  DFT  computed  as 
^  1/P  +  1/P^  +  1/P^)  reduction  operations  and  a  like  number  of  P  point  ID 

DFT. 
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•  4D  — >  2D  ,  Prime  Case. 

N  X  P  X  P  X  P  4D  DFT  computed  hy  N  x  P  2D  DFT, 
where  P  is  prime  and  N  is  any  number. 

1.  Covering  planes 

Pto  =  ^2((0,  mi,m2  =  0, 1,...  ,P  -  1 

Hi  =  P2{(0,0,l,m2)) 

>i2  =  ^2((0, 0,0,1)) 


2.  Reduction  stage 


p-i  p-i 

^]dud2)  ~  x{di,  ^2  —  "^1*1  —  rn2i2,  i\,  ^2),  mi, m2  =  0, 1, . . .  ,P  —  1 

ii=0  >3=0 

P-1  P-1 

^(di,  ll,  6.2  —  m222,  *2) 

•  i  =0  13=0 
P-1  P-1 

^{di,d2)  ~  Hi  Hi  ^2) 

ij  =0  *3=0 


di  =  0,l,...,iV-l 
d2  =  0,l,...,P-l 

3.  2D  X  P  DFT  stage 

N-\  P-1 

F(si,  S2,  S2mi,  S2m2)  =  EE  ^(d\,d2)^N 

di  —0  (^3  =0 

;v-i  p-i 

F(Si,  0,  52,  52m2)  =  L  Z 

rf,=0rfj=0 

F(si,  0,  0,  S2)  =  E  E 


51  =  0,  i,...,Ar  - 1, 

52  =  0,  1,  .  .  .  ,  P 


mi,  m2  =  0, 1, . . . ,  P  —  1 


The  computation  of  the  N  x  P  x  P  x  P  4D  DFT  by  this  method  requires  P^  +  P  +  1 
reduction  operations  and  a  like  number  of  2D  N  x  P  DFT.  Note  that  [33]  is  a  source 
of  O(PlogP)  prime  size  FFT  algorithms. 

To  compute  a  4D  DFT  by  the  line  approach  each  dimension  would  have  to  be  of 
equal  size.  The  corresponding  case  isaPxPxPxP  DFT  computed  by  P^  +  P^  +  P  +  l 
reduction  operations  and  a  like  number  of  P  point  ID  DFT. 
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•  4D  — >  2D  ,  Power  of  Prime  Case. 

N  X  X  X  MD  DFT  computed  by  iV  x  P"  2D  DFT, 
where  P  is  any  prime  (including  2)  and  N  is  any  number. 

1.  Covering  planes 

Ho  =  H2{{0,  I, mi, m2)),  =  0, 1,. . .  ,P"  -  1 

Hi  =  H2{{0,  riP,  1,  m2))  r.  =  0, 1, . . . ,  P""'  -  1 
H2  =  P2((0,riP,r2P,l)) 

2.  Reduction  stage 

P"-!  P"-l 

H  ^  ^{d\,  d.2-miii-m2i2,ii,i2),  mj,  m2  =  0, 1, . . . ,  P"  -  1 
11=0  *2=0 
pn_i 

H  <^2 -nPii  -  m2i2,  12),  ri,r2  =  0, 1,...  ,P""‘ -  1 

tl=0  12=0 
pn_l  pn_i 

m  -^(^1’  *»’  *2,  «^2  -  riPi'i  -  r2Pi2) 

tl  =0  12=0 

d,  =0,l,...,iV'-l 
d2=0,l,...,P’‘-l 

3.  2D  A"  X  P"  DFT  stage 

yv-i  p”-! 

V{Si,  S2,  S2mi,  S2m2)  =  Z! 

(i]  =0  (^2=0 

V{si,  riPs2,  S2,  52^2)  =  Z  Z 

d,=0  rfj=0 

F(si,  riP52,  r2Ps2,  52)  =  ^  Z 

rf,=0  (i2=0 

Si  =  o,l,...,ivr  -  1, 

S2=0,1,...,P" 

The  computation  of  the  iV  x  P"  x  P"  x  P**  4D  DFT  by  this  method  requires  (P")*(l  + 
1/P  4-  1/P^)  reduction  operations  and  a  like  number  of  2D  iV  x  P”  DFT. 

To  compute  a  4D  DFT  by  the  line  approach  each  dimension  would  have  to  be 
of  equal  size.  The  corresponding  case  is  a  P"  x  P"  x  P”  x  P"  DFT  computed  by 
(pn)3(i/p2  ^  ^  reduction  operations  and  a  like  number  of  P”  point  ID  DFT. 
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Chapter  3 

An  Algorithm  for  the  Multiprocessor  Computation  of  MD  DFT 


3.1  Introduction 

This  chapter  describes  a  new  algorithm  for  the  parallel  computation  of  the  multidi¬ 
mensional  discrete  Fourier  transform.  The  features  of  the  algorithm  are  that  it  (1) 
eliminates  the  need  for  interprocessor  communication  and  (2)  that  it  maps  to  machines 
with  any  number  of  processors.  The  cost  of  eliminating  the  interprocessor  commu¬ 
nication  typical  for  such  algorithms  is  that  one  addition  must  be  performed  at  each 
processing  element  for  each  word  broadcast  to  the  machine.  The  algorithm  depends  on 
a  machine  model  that  supports  the  communication  functions  broadcast  and  report. 

The  methodology  proposed  in  this  chapter  is  based  on  the  fc— dimensional  hyper¬ 
plane  algorithm  previously  reviewed.  The  central  feature  of  that  algorithm  with  respect 
to  parallel  processing  is  that  it  does  not  intersperse  commimication  stages  with  pro¬ 
cessing  stages.  Two  basic  applications  of  the  hyperplane  algorithm  are  considered. 

The  first  is  a  direct  mapping  of  the  hyperplane  algorithm  to  the  multiprocessor.  It 
is  shown  that  such  a  mapping  can  be  made  when  the  number  of  processors  is  equed  to 
the  number  of  A:— dimensional  hyperplanes  required  to  cover  the  d— dimensional  array 
(k  <  d).  For  this  case  the  d— dimensionad  DFT  is  required  to  be  of  equad  and  prime, 
or  power  of  prime,  size  in  d  —  A:  -I-  1  (or  more)  dimensions.  The  degree  of  patrallelism 
of  the  algorithm  is  the  number  of  A:— dimensional  hyperplanes  in  a  covering  set.  The 
granularity  of  the  algorithm  is  a  single  Ar— dimensioned  DFT. 

The  major  benefit  of  the  direct  mapping  of  the  hyperplane  algorithm  is  that  it 
eliminates  the  requirement  for  interprocessor  commimication  for  efficient  MD  DFT 
computation.  The  major  limitation  of  this  method  are  the  restrictions  it  imposes  on 
the  machine  size.  There  are  only  d  —  1  possible  mappings  of  the  algorithm,  and  the 
degree  of  parallelism  changes  exponentially  between  mappings. 

This  limitation  is  addressed  by  an  alternative  mapping  of  the  hyperplane  algorithm. 
The  variant  developed  is  for  those  caises  where  the  degree  of  parallelism  of  the  machine 
is  not  matched  by  the  number  of  A'— dimensional  hyperplanes  required  to  cover  the 
d— dimensional  array.  For  these  cases,  the  midtidimensional  Cooley-Tukey  (MD  CT) 
algorithm  is  employed  together  with  the  hyperplane  algorithm.  The  role  of  the  MD  CT 
algorithm  is  to  factor  the  d— dimensional  DFT  into  stages  of  lower  order  d— dimensional 
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DFT.  The  hybrid  algorithm  that  results  has  a  degree  of  parallelism  equal  to  the  number 
of  it— dimensional  hyperplanes  required  to  cover  one  of  the  resulting  lower  order  arrays. 
This  hybrid  method  allows  great  flexibility  in  matching  the  degree  of  parallelism  of  the 
algorithm  to  the  size  of  the  target  processor. 

The  remainder  of  this  chapter  is  organized  as  follows:  First,  a  machine  model  is 
presented  for  a  generic  broadcast  mode  multiprocessor.  Then  the  direct  mapping  of  the 
hyperplane  algorithm  to  the  multiprocessor  machine  model  is  stated.  Detailed  cases  of 
the  algorithm  are  presented  for  the  2D  prime,  2D  power  of  prime  (including  2),  and 
3D  prime  cases.  The  hybrid  algorithm  is  introduced  in  chapter  4. 


Machine  Model 

The  algorithm  is  defined  with  respect  to  a  broadcast  mode  multiprocessor  machine 
model.  The  machine  is  taken  to  be  a  collection  of  processing  elements  (PEs)  together 
with  an  interconnection  network  for  interprocessor  communication.  The  machine  is 
externally  connected  to  a  host  by  a  single  I/O  channel.  All  data  enters  the  machine 
through  the  I/O  channel.  The  commtmication  functions  that  are  supported  must  in¬ 
clude  broadcast  and  report. 

1.  Broadcast:  This  function  downloads  data  from  the  I/O  channel  to  all  processing 
elements. 

2.  Report.  This  function  allows  a  distinguished  PE  to  upload  data  to  the  I/O  chan¬ 
nel. 

Comparative  Measures 

Throughout,  a  broadcast  m.ode  multiprocessor  mapping  of  the  row-column  algorithm 
is  used  for  comparative  purposes.  That  algorithm  is  shown  in  figure  3.1. 

By  the  algorithm  of  figure  3.1,  a  2D  DFT  task  is  seen  to  require  edternating  stages 
of  commimication  and  computation.  The  communication  stages  include  the  download 
of  input  data,  the  upload  of  transformed  data,  and  the  exchange  of  intermediate  results 
between  processors.  The  computation  stages  are  both  ID  iV— point  DFT. 


3.2  Algorithm  Definition 

The  mapping  of  the  fc— dimensional  hyperplme  algorithm  of  chapter  2  to  the  broadczist 
mode  multiprocessor  described  above  is  given  in  this  section.  The  hyperplame  algorithm 
is  computed  by  a  reduction  stage  followed  by  a  stage  of  A:— dimensioned  DFT.  The  prime 
case  of  the  algorithm  requires 

pd-k+i  _  I 
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1.  Broadcast  the  N  rows  of  x  uniformly  among  the  PEs;  each  PE  is  assigned  its  own 
set  of  rows. 

2.  In  parallel,  compute  the  N  independent  ID  FFT(N)  associated  with  the  rows. 

3.  Globally  transpose  the  intermediate  results  in  the  machine. 

4.  In  parallel,  compute  the  N  independent  ID  FFT(N)  associated  with  the  columns. 

5.  Upload  results.  The  transformed  data  is  stored  column-wise  in  the  machine.  A 
transpose  must  be  achieved  at  some  point  of  the  upload  process  if  data  is  to  be 
returned  in  row  major  form. 


Figure  3.1:  A  multiprocessor  implementation  of  the  row-column  algorithm. 


reduction  operations  a^'  given  by  expression  (2.20)  of  section  2.2.  The  power  of  prime 
case  has  more  redundancy,  and  requires 


reduction  operations  a^'  given  by  expression  (2.34)  of  section  2.3.  Throughout  the 
sequel,  let  M  denote  the  number  of  reduction  operations  of  the  algorithm  {ie.  number 
of  covering  A:— dimensional  hyperplanes).  For  both  cases  the  dimensional  DFT  sure 
computed  on  data  derived  from  the  input  using  only  additions.  They  produce  data  on 
hyperplanes  W,  of  the  output  array.  These  hyperplanes  axe  defined  by  theorems  2.2.1 
and  2.3.1  for  prime  and  power  of  prime  cases  respectively.  Conditions  for  implementing 
the  hyperplane  algorithm  on  a  multiprocessor  with  no  interprocessor  communication 
are  given  by  the  following  theorem  and  proof. 

Theorem  3.2.1  The  optimal  degree  of  parallelism  for  minimizing  interprocessor  com¬ 
munication  is  the  number  M  of  k  — dimensional  hyperplanes  required  to  cover  the  output 
array. 

Proof.  A  A:-dimensional  discrete  Fourier  transform  is  performed  on  the  fc— dimensioned 
array  produced  by  each  reduction  operation.  Hence,  no  interprocessor  communication 
is  required  from  input  to  output  if  the  granularity  of  the  reductions  a^'  is  no  finer  than 
that  to  put  all  d  €  ZfNi  x  •  •  •  x  ZJNk  points  of  an  aj*  in  a  single  processor.  Any  finer 
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degree  would  necessitate  interprocessor  communication  between  the  stage  in  which  the 
sununations  of  the  reduction  are  computed  and  the  stage  where  the  fc— dimensional 
discrete  Fourier  transforms  are  computed. 

□ 

Consider  also  that  the  reduction  stage  has  no  data  interdependencies  and  can  be 
computed  simultaneously  with  the  data  download.  Exploiting  this  concurrency  places 
a  lower  limit  on  the  partitioning  of  the  computation.  Parallel  machines  externally 
connected  to  an  I/O  channel  have  completion  times  bound  by  the  iV^  word  download 
and  word  upload.  Regardless  of  the  speed  of  addition,  or  the  number  of  PEs,  the 
download  time  of  words  limits  the  reduction  stage  completion  time.  Assuming 
granularity  no  finer  than  a  complete  reduction  operation,  an  L  PE  machine  reaches  the 
limit  if  add  time  is  [M/X]  faster  than  communication  time. 

All  cases  of  the  multiprocessor  hyperplane  algorithm  eu’e  defined  by  the  same  basic 
structure.  That  structure  is  outlined  in  figure  3.2  below. 


1.  Assign  the  reduction  operations  uniformly  among  the  processing  elements 
(PE). 

2.  During  the  download  (broadccist)  ot  the  input  data  to  the  machine,  each  PE  forms 
the  a^'  terms  assigned  to  it.  For  every  a^*  term  assigned  to  a  PE  only  1  addition 
is  pe^ormed  for  each  word  broadcast  to  the  PE.  When  the  input  download  is 
complete,  the  reduction  stage  of  the  computation  is  also  complete. 

3.  Each  PE  performs  a  single  fc— dimensional  DFT  to  finish  the  computation. 

4.  Upload  results  and  remove  redimdant  data. 


Figure  3.2:  Structure  of  the  multiprocessor  hyperplane  algorithm 

Using  the  hyperplane  algorithm,  a  MD  DFT  task  requires  stages  of  communication 
and  stages  of  computation.  The  communication  stages  include  only  the  data  download 
and  data  upload  operations.  The  computation  stages  are  the  DFT  and  data  reduction 
operations.  The  communication  and  computation  stages  eu'e  the  same  as  the  serial  and 
parallel  segments  of  the  algorithm.  That  is,  the  serial  portions  of  the  algorithm  are 
the  data  download  and  the  data  upload,  and  the  parallel  segments  of  the  algorithm  are 
the  DFT  and  reduction  operations.  Unlike  the  computation  of  the  DFT,  the  reduction 
operations  contain  no  data  interdependencies,  and  cein  be  computed  in  parallel  with 
the  download  operation.  On  a  machine  with  M  processors,  the  computational  speedup 
of  the  algorithm  relative  to  a  single  processor  computing  the  row-column  algorithm  is 

Speedup  =  —  (iV)'^“*  . 
k 
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This  assumes  the  additions  of  the  reduction  stage  are  computed  simultaneously  with 
the  data  download  and  incur  no  time  penalty  of  their  own.  In  the  remainder  of  this 
section  the  sdgorithm  is  stated  in  some  detail  for  the  2D  prime,  2D  power  of  prime,  and 
the  3D  prime  ceises. 

Prime  Case 

The  2D  prime  case  of  the  algorithm  is  developed  below.  For  this  czise  the  hyperplane 
algorithm  reduces  to  the  line  algorithm.  In  section  1.3,  the  prime  caise  of  the  line 
algorithm  was  shown  to  be  evaluated  by  the  following  two  step  procedure. 

1.  Reduction  stage.  Compute  the  P  +  1  summations: 

P-i 

x{i,  d  —  mi) 

1=0 

t=0 

2.  DFT  stage.  Compute  the  P  +  1  one-dimensional  P— point  DFT: 

V{mtJ)  =  m  =  0,l,...,P-l 

d=Q 

V{t,Q)  =  <  =  0,1,...,P-1 

d=0 

The  summation  terms  reduce  the  input  data  to  the  vectors  that  must  be  trans¬ 

formed  in  order  to  obtain  the  output  along  the  lines  L{{m,  1)).  The  summation  term 
gives  the  vector  that  must  be  transformed  to  obtain  the  output  along  the  line 
i((l,0)).  The  computation  of  the  P  x  P  line  algorithm  on  a  broadcast  mode  multi¬ 
processor  is  realized  by  the  procedure  of  figure  3.3. 
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1.  Assign  the  computation  of  the  reduction  operations  evenly  among  M  or 

fewer  PEs. 

2.  Broadcast  the  rows  of  input  data  to  the  PEs  and  simultaneously  compute  the 

reductions  at  the  PEs.  The  reduction  operations  are  computed  ais  follows; 

When  row  i  is  received,  the  PE  assigned  sums  the  elements  of  the  row  and 
places  that  sum  in  position  i  of  In  a  similar  maimer,  the  PE  assigned 

rotates  row  i  mi  positions  to  the  right  and  sums  it  componentwise  to  the 
other  received  rows.  In  this  fashion  will  exist  at  position  d  of  the  first 

row  received.  The  rotation  can  be  achieved  by  an  address  offset  and  modulo  P 
address  arithmetic. 

Note  that  the  partial  result  due  to  row  i  is  accumulated  as  row  i  is  received.  In 
this  way  each  PE  requires  only  0{N)  storage  for  each  it  is  assigned. 

3.  In  parallel,  each  PE  computes  a  P— point  ID  DFT  for  every  reduction 
assigned  to  it. 

4.  Upload  data  to  host.  There  is  some  redimdancy  in  the  data,  and  the  data  is 
permuted  among  the  PEs. 

Redundancy:  For  this  case,  the  redundancy  is  trivi^llly  that  output  point  U(0, 0) 
is  common  to  every  PE. 

Permutation:  The  permutation  is  that  every  PE  contaun  a  line  of  output  data  as 
specified  by  theorem  1.3.1.  The  row  0  and  column  0  are  in  the  PEs  which 
computed  the  and  terms,  and  axe  not  permuted.  Element  I  of  the 
vector  produced  by  the  PE  that  computed  is  the  colunm  I  and  row 

(iV  —  ml)  mod  P  element  of  the  2D  DFT. 


Figure  3.3:  2D  P  x  P  multiprocessor  mapping  of  the  line  algorithm. 
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To  obtain  a  measure  of  the  performance  of  the  algorithm,  allow  a  comparison  of 
a  multiprocessor  with  P  +  \  PEs  running  the  parallelized  line  algorithm  to  a  single 
processor  computing  the  row  column  algorithm.  Let  the  computational  burden  of 
the  reduction  stage  be  taken  simultaneously  with  the  data  download.  Each  reduction 
operation  for  the  2D  P  x  P  case  requires  O(P^)  additions.  Then  the  speedup  of  the 
parallel  line  algorithm  on  P  +  1  processors  relative  to  the  row-column  method  on  a 
single  processor  is 


Speedup 


2  .  p  .  FFTiP) 
PPT(P) 


=  2P. 


The  algorithm  achieves  a  2P  speedup  with  P  +  I  processors  rather  than  the  expected 
P  speedup  because  the  number  of  ID  DFT  has  been  reduced  from  2P  to  P  -f  1.  The 
justification  for  not  including  the  time  cost  of  the  additions  of  the  reduction  stage  is 
that  these  additions  are  concurrent  with  the  data  download  operation,  which  is  common 
to  both  algorithms  and  can  not  be  parallelized. 

The  performance  improvement  of  the  algorithm  depends  entirely  on  the  computa¬ 
tion  of  the  reduction  operations  in  parallel  with  the  data  download.  Consider  that  a 
single  processor  computing  just  the  reduction  stage  of  the  line  algorithm  would  require 
O(P^)  additions.  Below,  the  mapping  of  the  2D  power  of  prime  and  3D  prime  cases  of 
the  algorithm  are  considered. 


Power  of  Prime  Case 

This  section  gives  the  power  of  prime  case  of  the  algorithm  specialized  for  2"  x  2" 
2D  DFT.  The  main  differences  between  the  prime  and  power  of  prime  cases  are  the 
computation  of  the  reduction  stage,  emd  the  removed  of  redimdant  data.  The  power  of 
prime  case  of  the  line  algorithm  is  given  below. 

1.  Reduction  stage.  Compute  the  2"  -|- 2”~’  summations: 

=  J2x(t,d-mi) 

i=0 

^  x(d-2si,i) 
i=0 

2.  DFT  stage.  Compute  2"  +  2”“*  one-dimensional  2"— point  DFT: 

m  =  0,l,...,2"-l 

d=0 

»  =  0,1 . 2-'-l 

d=0 

f  =  0,1,...,2"  -1 


V{t,2st)  = 
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The  computation  of  the  terms  is  the  same  as  the  prime  case.  That  is,  when 

row  i  is  received,  the  PE  assigned  rotates  it  mi  positions  to  the  right,  and  adds 

component-wise  to  the  previously  received  rows.  This  compresses  the  data  so  that  the 
storage  requirement  for  this  stage  is  only  that  of  a  single  row. 

The  computation  of  the  terms  is  similar.  The  PE  assigned  rotates 

column  i  2si  positions  to  the  right,  and  adds  component-wise  to  the  other  columns. 
The  problem  this  introduces  is  that  the  data  is  downloaded  in  row  major  order.  The 
following  procedure  computes  the  terms  given  row-major  data. 

Let  Q;  be  the  maximum  such  that  2“l2s,  and  let  x{i,j)  denote  element  j  of  row  i.  As 
row  i  is  received,  the  PE  sums  all  elements  of  the  row  which  are  2"““  positions  apart. 
There  will  be  2"“"  such  sums  each  of  2“  elements.  These  can  be  expressed  eis 

20-1 

Ci.,  =  ^  x(i, /-h2"-“fc),  /  =  0,l,...,2"-“-l. 

k=o 


The  of  these  2”  “  sums  of  the  elements  of  row  i  are  accumulated  with  the  (i  -I- 
2s/)  mod  2”  element  of  Essentially,  the  c,,(  terms  of  any  are  computed 

by  striding  through  a  received  row  by  2"~"  and  forming  an  accumulated  sum  of  those 
points.  The  following  example  illustrates  this  procedure. 


Example:  Consider  a  4  x  4  2D  DFT.  The  reduction  operations  are  required 

for  s  =  0,1.'  Using  the  procedure  outlined  above,  the  reductions  are  formed  from 
downloaded  rows  in  the  following  manner. 

For  s  =  0,  the  maximum  a  such  that  2“|0  is  n.  The  PE  assigned  this  term  must  add 
elements  of  each  row  that  eire  1  apart;  cill  elements  of  each  row  received  axe  summed 
together.  The  term  is  the  accumulated  sum  of  row  d. 

For  5  =  1,  the  required  reduction  operation  is  The  maximum  a  such  that 

2°'|25  is  a  =  1.  As  row  i  is  received,  the  PE  sums  elements  which  are  2  positions  apart. 
There  will  be  2  such  sets  formed  for  each  row.  These  are: 


Co,0  =  3^0,0  +  Xo,2 
Cl,0  =  ^1.0  +  Xi.2 
C2,0  =  *2.0  +  *2.2 

C3.O  =  *3.0  +  *3.2 


Co.l  =  *0.1  +  *0,3 

Cl.l  =  *1,1  +  *1.3 

C2.I  =  *2,1  +  *2,3 

C3,l  =  *3,1  +  *3,3- 


After  row  i  is  received  and  the  c,j  are  formed,  the  elements  of  axe  given  as: 

(>.2)  , 

^0  —  ^0,0  +  C2,l 

(1.2) 

Oj  —  Ci,o  +  C3,i 

(1.2) 

O2  —  ^2,0  +  Co.l 

(1.2)  , 

<23  —  2^3.1  4-  Ci  j. 
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This  finishes  the  reduction  stage  of  the  algorithm.  The  ID  DFT  are  performed  on 
points 


a 


(i,y) 
d  ’ 


d  =  0,...,P-l 


to  complete  the  computation. 

□ 


As  in  the  prime  case,  when  the  computation  is  complete  there  are  some  output 
points  that  appear  in  more  than  one  processor.  The  procedure  of  figure  3.4  allows  the 
data  to  be  uploaded  with  the  redundant  points  removed.  In  the  procedure,  £((x,  y))  de¬ 
notes  the  line  of  output  computed  by  the  PE  that  was  assigned  the  reduction  operation 


1.  Upload  all  points  of  the  line  T((0, 1)). 

2.  For  each  a  =  0, 1, . . . ,  n  —  1,  define  j  =  2“, . . . ,  2“"^^  —  1  and  upload  all  points  on 
the  lines  T((j,  1))  except  multiples  of  2’*"“. 

3.  Upload  all  points  of  the  line  T((1,0))  except  (0,0). 

4.  For  each  a  =  0, 1, . . . ,  n  —  2,  define  k  =  2“, . . . ,  2“''’*  —  1  and  upload  eill  points  on 
the  lines  £((l,2fc))  except  multiples  of  2”“““*. 


Figure  3.4:  2"  x  2"  data  upload  procedure. 


Example:  Consider  a  4  x' 4  2D  DFT.  The  array  below  shows  the  effect  of  the  upload 
strategy  of  figure  3.4  on  the  set  of  covering  lines.  The  leftmost  column  of  the  array 
enumerates  the  lines  L{{x,y))  that  cover  the  output  array.  All  elements  of  the  vector 
listed  in  the  corresponding  row  of  the  array.  The  elements  that  are  not 
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uploaded  axe  boxed. 


I((0,1)) 

Vb.o 

Vo.i 

Vo.2 

Fo,3 

I((l,l))  1 

K,.oi 
_ 1 

Vi.i 

V2.2 

13,3 

L((2A))  I 

Vb.o 

V2A 

Fo,2 

^2,3 

L({3A)) 

V^,,o 

Fa.. 

F2,2 

1  F,,3 

I((1,0)) 

Vb.o 

F,.o 

F2.0 

Fs.o 

mm)  1 

Vb,o 

1  ^>.2  i 

F2,2 

1  F3.2 

Observe  that  all  the  elements  of  any  vector  (line)  not  uploaded  from  a  PE  are  multiples 
of  2  apart. 

□ 

Note  that  line  algorithm  required  6  ID  DFT  to  compute  the  DFT  of  the  example. 
This  compares  to  the  8  ID  DFT  required  by  the  row-column  algorithm.  Given  that  the 
additions  of  the  reduction  operations  are  performed  concurrently  with  the  input  data 
download,  a  6  PE  machine  could  compute  this  DFT  in  the  time  required  for  the  data 
upload  and  download,  plus  the  computation  time  for  one  ID  DFT.  This  represents 
a  speedup  in  the  DFT  computation  of  8  relative  to  a  single  processor  executing  the 
row-column  method. 

If  the  allowed  granularity  is  no  smaller  than  a  ID  DFT  the  maximum  degree  of 
parallelism  of  the  row-column  method  is  only  4,  even  though  8  ID  DFT  2ire  required. 
This  is  due  to  the  fact  that  the  ID  DFT  of  the  second  stage  (ie.  column)  are  dependent 
on  the  first  stage.  On  a  4  PE  machine,  the  row-column  method  would  require  time 
for  the  data  upload  and  download,  time  for  2  ID  DFT,  and  time  for  a  globed  data 
transpose  that  requires  every  PE  to  exchange  data  with  every  other  PE.  If  we  assume 
that  the  global  transpose  requires  zero  time,  then  the  speedup  of  this  method  relative 
to  a  single  processor  computing  the  row-column  method  is  4. 

The  parallelized  row-colunm  method  for  N  x  N  2D  DFT,  N  =  2",  gives  a  linear 
speedup  limited  by  N  (disregarding  transpose  time).  The  parallel  line  algorithm  attauns 
speedup  of  2N  given  3N/2  PEs.  This  ”super  linear"  performance  is  due  to  the  reduction 
of  the  number  of  ID  DFT  from  2N  to  3N/2  and  to  the  fact  that  the  additions  of  the 
reduction  stage  were  counted  with  the  communication  operations  of  the  data  downloaui. 
The  major  limitation  of  this  method  is  that  it  does  not  scale  to  machines  whose  degree 
of  parallelism  is  lower  than  3N/2.  This  constraint  will  be  eliminated  in  later  sections. 
Before  proceeding  to  that  work  the  3D  prime  caise  of  the  algorithm  is  discussed. 
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3D  Case 


The  general  mapping  of  the  3D  prime  case  of  the  hyperplane  algorithm  is  presented 
below.  The  primary  motivation  for  the  algorithm  is  that  it  exchanges  increased  com¬ 
putation  time  (larger  granularity)  for  a  decreased  degree  of  parallelism  relative  to  the 
parallel  line  method  (k=l). 

The  example  covered  in  this  section  is  the  3D  P  x  P  x  P  DFT,  where  P  is  a  prime 
number.  If  the  line  algorithm  were  apphed  (case  of  fc  =  1),  a  number  of  lines  equal  to 
P^  +  P  +  1  would  be  required  to  cover  the  output  array,  and  therefore  a  like  number 
of  processors  is  needed.  To  reduce  the  degree  of  parallelism  of  the  computation,  take 
k  =  2  and  compute  planes  of  the  output.  For  this  case  the  degree  of  parallelism  is  P-|- 1 
(number  of  planes  in  a  covering  set)  and  the  granularity  is  a  single  P  x  P  2D  DFT. 

In  the  previous  section  the  convention  that  data  was  input  to  the  machine  in  row- 
major  order  was  adopted.  In  order  to  remain  consistent  with  this  convention  it  is 
necessary  to  consider  a  different  set  of  covering  planes  that  those  given  in  theorem 
2.2.1.  Allow  the  covering  planes  to  be  given  by 

P2((0,m.l))  =  {{sx,  S2m,  S2)  :  Si,S2  6  Z/P], 

P2((0,l,0))  =  {(s„S2,0)  :  st,S2  G  Z/P},  (3.1) 

m  =  0, 1, . . . ,  P  —  1. 

Various  data  flows  can  be  obtained  by  permuting  the  order  of  the  components  of  the 
hyperplanes  of  theorems  2.2.1  and  2.3.1,  In  order  for  the  sets  to  maintain  their  covering 
property  each  hyperplane  must  be  permuted  in  the  same  way. 

For  the  hyperplanes  of  (3.1)  the  algorithm  is  given  by  the  following  procedure: 


1.  Reduction  stage.  Compute  the  P 1  summations: 


W((0.m,l)) 
«(/,  Ml 


W((0.t.0)) 

Ml 


P-1 

x{dx,i,d2  -  mi) 

i=0 


P-1 

X]  ■^id\,d2,i) 

»=0 


2.  DFT  stage.  Compute  the  P  -t-  1  2D  DFT: 


d\  =0  (ij  =0 

di=0di=0 


m  =  0, 1, . . . .  P  —  1  and  .si,  S2  =  0, 1, . . . ,  P  —  1. 
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The  computation  of  the  3D  prime  algorithm  on  a  multiprocessor  is  essentially  the 
same  as  the  2D  case  given  in  figure  3.3.  They  key  differences  between  them  are  the 
evaluation  of  the  reduction  operations  and  the  removal  of  the  redundancy  in  the  data 
upload.  These  are  specified  below. 

Assign  the  P+1  reduction  operations  to  the  PEs.  During  the  download  operation 
the  rows  of  input  data  are  broadcast  to  the  machine.  Let  row  (ri ,  r2)  denote  the  P  input 
points  X(r,,r2,o)i  •  •  •  i  .rj ,p- 1 )  of  the  input  array.  The  PE  assigned  the  computation  of 

//((O.l.O)) 

forms  an  accumulated  sum  of  the  elements  of  row  (rj,  r2)  when  it  is  received.  That  sum 
is  placed  in  location  (ri,r2)  of  The  computation  of  this  term  requires  one 

addition  for  each  word  input  to  the  PE,  and  is  completed  when  the  last  input  enters 
the  PE. 

The  PEs  assigned  the  computation  of  the  a^do.m.i))  temis  compute  as  follows:  When 
row  (ri,  r2)  of  the  input  is  received  it  is  rotated  mr2  positions  to  the  right  and  sxunmed 
componentwise  with  the  elements  of  row  r|  of  The  rotation  is  achieved  by 

address  offset  and  modulo  N  address  arithmetic.  Similar  to  the  ID  case,  the  summation 
operation  compresses  the  data.  Therefore  the  reduction  stage  requires  the  PE  to  have 
storage  for  only  P^  points  for  each  reduction  operation  assigned  to  it. 

After  the  reduction  stage  is  finished,  the  2D  DFT  are  performed  on  the  in  each 
PE.  This  completes  the  computation.  The  output  is  distributed  among  the  PEs  as 
hyperplanes.  There  are  output  points  that  were  computed  in  more  than  one  PE.  It 
is  desirable  that  these  be  removed  before  the  results  are  uploaded.  The  redundant 
points  axe  all  on  the  the  line  L{{  1, 0, 0))  of  the  output  array.  In  order  to  eliminate  these 
points  during  the  data  upload,  only  one  PE  is  required  to  upload  the  points  on  the  line 
£((1,0))  in  its  2D  output  array.  That  is,  only  one  PE  uploads  the  tremsformed  points 

“0,0>  “l,0)  •  •  •  1 

For  the  3D  prime  case  the  multiprocessor  hyperplane  algorithm  requires  a  degree 
of  parallelism  equal  to  P  +  1  and  the  granularity  of  a  single  2D  P  x  P  DFT.  Therefore, 
the  computational  speedup  relative  to  a  single  processor  computing  the  row-colunan 
method  is 

^  j  3  •  .  FFT{P)  3P 

Spee  up-  2.p.FFT{P)  ~  2  ’ 

This  does  not  accoimt  for  the  data  upload  and  download  or  the  additions  of  the  reduc¬ 
tion  operations  that  are  performed  simultaneously  with  the  download. 

For  comparative  purposes,  consider  a  P  processor  machine  computing  the  row- 
column  like  procedure  of  figure  3.5.  The  modified  row-column  algorithm  of  figure  3.5 
requires  communication  and  computation  stages.  The  communication  stages  include 
the  upload  and  download  of  data  points,  and  the  exchange  on  intermediate  results 
(global  transpose).  The  computation  stages  are  ID  and  2D  DFT.  If  the  2D  DFT 
at  each  PE  are  computed  by  the  row-column  method,  then  this  algorithm  is  seen  to 
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1.  Broadcast  two  dimensions  of  data  to  each  PE. 

2.  Compute  2D  P  x  P  DFT  at  each  PE. 

3.  Globally  transpose  the  data  in  the  machine. 

4.  Compute  ID  DFT  on  the  remaining  dimension. 

5.  Upload  the  data. 

Figure  3.5:  A  modified  row-column  3D  algorithm. 


require  the  time  to  compute  3P  ID  DFT.  Its  computational  speedup  relative  to  a  single 
processor  computing  the  row-colinnn  method  is 

''  3PFFT(P) 

The  P  processor  implementation  of  the  row-column  like  procedure  of  figure  3.5  has 
a  speedup  of  P,  but  the  P  -f  1  processor  implementation  of  the  hyperplane  algorithm 
has  a  speedup  of  3P/2.  The  hyperplane  algorithm  achieves  a  50%  improvement  over 
the  row-coliunn  method  at  the  cost  of  only  a  single  processor.  The  advantage  of  the 
hyperplane  method  is  due  to  its  overall  reduced  number  of  DFT  relative  to  the  row- 
column  method. 

Conclusion 

This  section  demonstrated  a  multiprocessor  algorithm  for  MD  DFT  computation.  The 
algorithm  is  based  on  a  direct  mapping  of  the  hyperplane  algorithm  to  the  multipro¬ 
cessor  architecture.  The  algorithm  is  shown  to  naturally  partition  the  d— dimensional 
DFT  into  M  independent  computations,  M  equal  to  the  number  of  A:— dimensional 
hyperplanes  required  to  cover  the  d— dimensional  array. 

For  multiprocessor  architectures  a  mapping  is  given  that  requires  no  interprocessor 
communication,  and  allows  the  M  independent  computations  to  occur  concurrently 
with  the  input  download.  On  single  I/O  channel  machines  that  eire  capable  of  exploiting 
the  full  degree  of  parallelism  of  the  algorithm,  execution  times  as  low  as  the  time  to 
compute  a  single  one  dimensioned  FFT  on  N  points,  plus  the  time  to  upload  and 
download  the  data  are  attainable.  If  task  level  pipelining  is  used,  average  times  equal 
to  the  single  channel  I/O  limit  occur,  but  single  task  completion  times  are  longer. 
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Furthermore,  if  the  input  data  is  real,  the  only  stage  of  the  algorithm  that  inputs 
complex  data  is  the  data  upload. 

The  primauT^  benefit  of  the  algorithm  was  seen  to  be  that  it  requires  no  interprocessor 
commimication.  The  primary  limitation  of  the  algorithm  is  that  it  does  not  scale 
flexibly  to  the  degree  of  parallelism  of  the  target  processor.  This  issue  is  addressed  in 
the  next  section. 
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Chapter  4 

A  Hybrid  Algorithm  for  the  Multiprocessor  Computation  of  MD  DFT 


This  chapter  combines  the  hyperplane  and  multidimensional  Cooley-Tukey  algorithms 
to  produce  a  highly  scalable  broadcast  mode  multiprocessor  algorithm.  The  rnmn 
features  of  the  algorithm  are  that  it  requires  no  interprocessor  commimication,  and 
that  it  maps  to  machines  of  varying  degrees  of  parjdlelism.  The  cost  of  eliminating 
interprocessor  communication  by  this  method  is  that  one  addition  must  be  performed 
at  each  processing  element  for  each  input  broadcast  to  the  machine. 


Overview 

In  previous  sections  the  fc— dimensional  hyperplane  2ilgorithm  was  mapped  to  a  broad¬ 
cast  mode  multiprocessor  machine  model.  The  parallel  algorithm  that  resulted  requires 
no  interprocessor  communication.  However,  in  order  to  exploit  that  benefit,  the  num¬ 
ber  of  processors  in  the  machine  has  to  equal  the  number  of  fc— dimensional  hyperplanes 
required  to  cover  the  d— dimensioned  output  array.  For  meiny  cases  this  can  be  a  large 
number.  Consider  that  the  power  of  prime  case  (including  2)  of  the  2dgorithm  requires 
the  degree  of  parallelism  of  the  machine  to  be 

(¥)"(^)- 

The  granularity  of  the  corresponding  computation  is  a  single 
dimensional  DFT  of  size  (P")*. 

For  a  given  problem  size  the  only  means  available  for  adjusting  the  degree  of  peu-- 
allelism  is  to  alter  k.  The  effect  this  hats  on  the  adgorithm  is  to  change  the  dimension 
of  the  hyperplanes  used  to  cover  the  output  array.  An  increase  in  k  increases  the  grain- 
ularity  of  the  computation  performed  at  each  PE  by  increasing  the  dimension  of  the 
required  DFTs.  This  increaise  in  granularity  is  accompanied  by  am  associated  decrease 
in  the  degree  of  parallelism  required  by  the  algorithm. 

Using  this  method,  the  tradeoff  between  granularity  and  degree  of  pairadlelism  is 
limited  to  powers  of  P".  That  is,  for  an  increase  in  dimension  of  1,  the  size  of  the 
computation  increases  P"  times  and  the  degree  of  pairadlelism  decreases  roughly  P" 
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times.  At  the  limit  k  =  d  —  I,  this  method  allows  a  minimum  degree  of  parallelism  of 


The  corresponding  granularity  of  the  computation  performed  at  each  PE  is  a 
DFT.  For  the  iV  =  2"  case,  the  minimum  degree  of  parallelism  is  3jV/2,  and  the 
corresponding  granularity  is  a  single  d  —  1  dimension  DFT  of  size  {NY~^  . 

The  primary  limitations  that  result  from  manipulating  only  the  dimensionality  of 
the  problem  are:  (1)  that  the  degree  of  parallelism  can  only  be  scaled  by  powers  of  P”, 
and  (2)  that  there  axe  only  d  —  1  possible  mappings  of  the  algorithm. 

One  method  for  obtaining  increaised  control  over  the  degree  of  parallelism  of  the 
algorithm  is  to  modify  the  order  (size)  of  the  computation  in  each  dimension.  This 
can  be  done  using  multidimensional  Cooley-Tukey  (MD  CT)  /  vector-radix  techniques. 
The  feature  of  the  MD  CT  algorithms  central  to  this  work  is  that  they  can  factor  a 
d— dimensional  DFT  into  stages  of  lower  order  d— dimensional  DFT. 

In  the  next  section  an  algorithm  that  marries  the  hyperplane  algorithm  to  the 
MD  CT  algorithm  is  introduced.  The  resulting  hybrid  algorithm  uses  the  MD  CT 
algorithm  to  reduce  the  order  of  the  computations  performed  in  each  dimension,  and 
the  hyperplane  algorithm  to  reduce  the  number  of  dimensions.  The  hybrid  algorithm 
requires  no  interprocessor  communication,  and  can  be  scaled  to  match  the  degree  of 
parallelism  of  the  target  machine. 

4. 1  Derivation 

The  hybrid  algorithm  will  be  developed  as  follows:  First  the  structure  of  the  MD  CT 
factorization  is  reviewed.  Then  a  nesting  of  the  hyperpleme  algorithm  within  the  MD 
CT  factorization  is  given  that  permits  the  MD  DFT  to  be  computed  with  a  reduced 
degree  of  paxallelism  and  no  interprocessor  communication. 

MD  CT  Components 

The  structure  of  the  MD  CT  (vector-radix)  factorization  of  the  2D  nxn  DFT  where  n  = 
rs  is  given  by  the  block  diagram  in  figure  4.1.  The  diagram  describes  the  relationship 
between  the  first  and  second  DFT  stages  of  a  MD  CT  factorization  for  n  =  rs.  In 
the  diagram,  each  DFT  of  a  stage  is  identified  by  a  rectangle  with  an  index  (x,j/) 
affixed  to  it.  DFT  i  of  stage  two  refers  to  the  DFT  whose  index  {x,y)  is  the  i"*  taken 
lexicographically.  The  diagram  shows  graphically  that  input  j  to  DFT  i  of  stage  two 
is  from  output  i  of  DFT  j  of  stage  one. 

The  computation  and  data  flow  of  4.1  are  defined  in  terms  of  the  decomposition 
of  the  input  and  output  index  sets.  The  procedure  below  defines  the  computation  in 
terms  of  the  coset  decomposition  of  the  index  set. 
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Stage  2 


Stage  1 


Figure  4.1:  Flow  diagram  of  an  n  x  n  2D  FFT  factorization  for  n  =  rs. 

1.  Computation  of  d— dimensional  DFTs  Fr-  Each  of  these  is  performed  on  a  coset 
of  the  input  with  respect  to  the  subgroup  sZ/n. 

2.  Twiddle  multiplications.  The  element  a  of  the  input  coset  u  that  was  transformed 
in  step  (1)  is  multiplied  by 

3.  Computation  of  d— dimensional  DFTs  F,.  Each  of  these  produces  a  coset  of 
output  with  respect  to  the  subgroup  rZ/n.  The  input  and  output  subgroups  are 
du2Ll. 

This  procedure  relates  directly  to  the  diagram  of  figure  4.1.  The  DFTs  of  stage 
one  of  the  figure  correspond  to  the  F-  of  step  one  of  the  procedure,  and  the  DFTs  of 
stage  two  of  the  figure  correspond  to  step  three.  Step  two  of  the  procedure  defines  the 
twiddle  multiplications.  The  diagram  shows  the  input  points  to  each  DFT  of  stage  one. 
The  DFT  labeled  (0,0)  is  seen  to  be  computed  on  the  points  whose  indexes  are  given 
by  the  set 

{(i,t/)  :  (x,y)  e  sZ/n  X  sZ/n}. 
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Similarly,  the  DFT  labeled  (0, 1)  is  computed  on  the  points  whose  indexes  are  in  the 
set 

+  (0, 1)  :  (x,yy  e  sZjn  X  sZI,.\. 

That  is,  the  input  data  for  the  DFT  labeled  (0,0)  are  the  points  whose  indexes 
are  the  subgroup  sZ/n  of  Z/n,  and  the  input  data  to  the  DFT  labeled  (0, 1)  are  the 
points  whose  indexes  are  the  (0, 1)  coset  of  the  decomposition  of  Z/n  with  respect  to 
the  subgroup  sZ/n.  In  general  the  input  to  the  DFT  labeled  (j,i/)  is  coset  (x,y)  of 
the  decomposition.  The  cosets  of  the  decomposition  of  Z/n  with  respect  to  sZ/n  can 
be  enumerated  by  u  G  Z/s  x  Z/s.  In  this  way  number  u  is  said  to  transform  coset 
u  of  the  input  array. 

The  second  stage  of  DFT  can  be  described  in  a  similar  way.  The  difference  between 
the  stages  is  that  the  DFTs  of  stage  two  each  produce  a  coset  of  the  output  array  with 
respect  to  the  subgroup  rZ/n  x  rZ/n.  The  cosets  in  this  decomposition  are  enumerated 
by  a  G  Z/r  X  Z/r.  In  this  way  F,  number  a  produces  the  coset  a  of  the  output  array. 
These  are  the  output  points  whose  indexes  are  given  by 

{(x,y)  +  (n,,<i2)  :  (•r,y)  €  rZ/n  x  rZ/n}. 

Above,  the  input  and  output  data  flow  has  been  described  in  terms  of  the  decomposition 
of  the  input  and  output  index  sets.  Below,  the  data  flow  between  DFT  stages  is 
described  in  terms  of  those  same  decompositions. 

The  output  of  each  Fr  of  stage  one  is  associated  with  the  index  set  a  G  Z/r  x  Z/r. 
These  correspond  to  the  indexes  marked  on  the  left  column  of  the  stage  one  DFT  in 
the  diagram.  Similarly,  the  input  to  each  DFT  of  stage  two  is  associated  with  the  index 
set  u  G  Z/s  x  Z/s.  These  correspond  to  the  indexes  meirked  on  the  right  column  of  the 
DFT  of  stage  two.  Using  these  associations,  the  data  flow  between  the  DFT  stages  can 
be  described  in  terms  of  the  decompositions  of  the  input  eind  output  index  sets.  Recall 
from  the  discussion  above  that  an  F,  of  stage  two  is  labeled  by  a  G  Z/r,  and  m  Fr 
is  labeled  by  u  G  Z/s.  Then  the  F,  associated  with  coset  a  of  the  output  array  takes 
its  input  u  from  the  output  a  of  the  Fr  associated  with  input  coset  u.  This  relation  is 
depicted  graphically  below. 


u 


Evaluating  the  MD  DFT  by  the  factorization  of  (4.1)  requires  computing  DFT 
stages  composed  of  Fr  and  F,.  The  inputs  to  the  F,  were  shown  to  be  the  collection 
of  outputs  of  every  Fr.  Let  that  F,  be  assigned  to  a  PE  in  a  multiprocessor.  The 
PE  can  evaluate  that  F,  if  the  i"*  point  from  every  stage  one  Fr  is  available  at  the  PE. 
Below,  an  assignment  of  the  hyperplane  algorithm  to  the  DFTs  of  stage  one  is  given 
that  satisfies  this  criterion. 
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Hyperplasie  Components 

The  nmltiprocestior  hypeiplaiie  algorithm  developed  in  the  previous  sections  of  this 
chapter  will  be  applied  to  the  computation  of  the  DFTs  of  the  first  stage  of  the  MD  CT 
factorization.  Before  proceeding  with  the  development  of  the  algorithm,  first  consider 
the  computation  of  a  single  Fr  by  that  method. 

Assume  that  F,  is  of  the  same  size  in  d  —  -1-  1  dimensions,  and  let  =  •  •  •  =  rj. 

Then  the  indexing  set  of  F-  may  be  written 


C  =  Z/r  =  Z/ri  X  •  •  •  y.  Zlvd  =  Ay.  B, 
A  =  Z/ri  y  y  Z/vk-i  x  Z/r*,, 

B  =  (Z/rfc)'^-^ 


The  number  of  A:— dimensional  hyperplanes  required  to  cover  the 

d— dimensional  array  C  =  Z/r  is  denoted  M(r).  Let  the  number  of  PEs  in  the  machine 
be  M(r),  and  recall  that  on  such  a  machine  the  DFT  F  may  be  computed  by  the 
procedure  listed  below. 


1.  Assign  one  hyperplane  of  the  covering  set  to  each  PE. 

2.  Broadcast  the  inputs  to  the  machine,  and  simultaneously  compute  the  reduction 
operations. 

3.  Compute  a  single  A:— dimensional  DFT  over  A. 


Figure  4.2:  Computation  of  at  a  PE. 


After  this  procedure  is  complete  each  PE  has  its  own  A:— dimensional  hyperplane 
of  the  outputs  of  the  Fr.  For  each  of  the  remmning  Fr  of  stage  one,  apply  the  same 
assignment  of  hyperplanes  to  PEs  that  was  used  for  the  first  Fr-  These  DFTs  are 
computed  in  the  same  maimer  as  the  first. 

When  the  computation  of  this  stage  is  complete,  every  PE  has  one  A:— dimensional 
hyperplane  from  each  Fr  of  stage  one.  Since  each  PE  was  assigned  the  S2une  hyperplane 
in  every  Fr ,  the  PE  that  has  point  a  from  the  first  Fr  has  point  a  from  every  Fr  of  stage 
one.  In  the  previous  section  these  were  shown  to  be  the  points  needed  to  compute  coset 
a  of  the  output  partition.  The  PE  generates  coset  a  of  the  output  partition  by  applying 
the  appropriate  twiddle  multiplications  to  the  stage  one  outputs,  then  computing  a 
single  d— dimensional  DFT  F,. 

Below  an  example  is  given  to  illustrate  this  procedure.  The  example  is  for  a  9  x  9 
2D  DFT  computed  on  a  four  PE  broadcaist  mode  multiprocessor. 
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Example:  Consider  the  problem  of  computing  a  9  x  9  2D  DFT  on  a  4  processor  machine. 
A  direct  mapping  of  the  line  algorithm  (Ar  =  1  case  of  the  hyperplane  algorithm)  for 
this  computation  would  require  12  processors.  However,  the  hybrid  algorithm  using  a 
3x3  factorization  requires  only  4  processors.  This  algorithm  will  be  developed  below. 

The  computation  depends  on  the  eveduation  of  the  3x3  2D  DFT  F3  3,  by  the  line 
algorithm.  That  computation  is 


1  =  F’a.a 


and  is  described  by  the  example  at  the  end  of  section  1.3.  The  output  array  of  the 
3x3  2D  DFT  is  given  by  four  lines.  These  are; 


LiiUl)) 

^((1,0)) 


{^0,0 »  <^0,1  ,  -0,2} 
{^0,0  >  ,  Z2a] 

{20,0 »  22,1 ,  21,2} 
{20,0  1  2i,0  ,  22,0} 


Each  of  these  lines  is  assigned  to  a  different  processor.  Let  the  output  on  the  lines 
£r((m,  1)),  m  =  0,1,2  be  computed  by  PE^,  and  the  output  on  the  line  i)((l,0))  be 
computed  by  PE3.  As  the  inputs  are  broadcast  to  the  machine,  each  PE  computes 
the  reduction  operation  eissociated  with  its  2issigned  line  of  the  output.  After  the 
broadcast  is  complete,  each  PE  performs  a  3— point  DFT  to  produce  its  set  of  output 
points.  Each  of  the  3  x  3  2D  DFT  of  the  first  DFT  stage  of  the  factorization  are 
computed  in  this  way. 

The  9  X  9  2D  DFT  factored  into  stages  of  3  x  3  DFT  is  given  by  [14] 


Stage  2 

^  ““  ^  '  N 

r(a, +361,02  +  362)  =  Y.  (4-1) 

uj=0  U2=0 

•  Y  H  +  “l’3p2  +  U2V3‘’’‘t^3"'^ 

Pl=0p2=0 

' - - - / 

Stage  1 


The  first  stage  of  DFT  of  this  factorization  requires  the  computation  of  nine  2D  3  x  3 
DFT.  The  inputs  to  these  DFT  are  listed  in  the  eurrays  below.  Each  of  the  arrays  is 
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labeled  above  by  the  index  of  its  first  element  in  parenthesis. 


This  partition  of  the  input  elements  results  from  decimating  both  dimensions  of  the 
array  simultaneously.  Each  array  of  the  partition  is  a  coset  of  Z/9  x  Z/9  with  respect  to 
the  subgroup  3Z/9  x  3Z/9.  The  label  above  each  axray  is  its  cosr t  leader.  Throughout 
this  example,  the  arrays  will  be  defined  and  referred  to  as  cosets,  and  distinguished  by 
their  coset  leaders. 

Each  of  the  nine  required  3  x  3  2D  DFT  of  stage  one  are  computed  by  the  multipro¬ 
cessor  line  algorithm  as  described  above.  The  lines  cire  taken  over  each  of  the  cosets. 
The  assignment  of  lines  to  PEs  is  the  same  for  each  of  the  cosets.  The  PE  assigned  line 
T((0, 1))  in  coset  (0,0)  is  assigned  line  L((0, 1))  in  every  coset. 

For  simplicity  of  presentation,  allow  the  input  data  to  be  broadcast  to  the  PEs 
by  coset.  Then  the  reduction  operations  are  performed  simulteineously  with  the  data 
download.  Each  PE  performs  one  addition  for  each  word  it  inputs.  As  all  the  points 
of  a  coset  axe  received,  the  associated  reduction  operation  is  completed.  When  the 
download  is  finished  the  PE  performs  nine  independent  3— point  ID  DFT  to  complete 
stage  one. 

Consider  the  computation  of  the  line  T((0, 1))  in  each  coset  of  the  input  partition. 
For  each  coset  downloaded  the  PE  computes  the  required  reduction  operation.  When 
the  download  is  complete,  the  PE  performs  the  ID  DFT.  At  this  time  the  PE  contains 
all  the  stage  one  outputs  corresponding  to  the  first  row  of  each  coset  shown  above 
(enclosed  in  rectangles).  The  appropriate  twiddle  multiplications  are  applied  at  this 
time. 

The  second  stage  of  DFTs  requires  computing  nine  3  x  3  2D  DFT.  To  obtain  all 
output  points  by  the  method  described  in  this  section,  each  PE  must  compute  three 
3  X  3  2D  DFT.  For  the  PE  aissigned  line  T((0, 1))  of  each  stage  one  coset,  one  2D  DFT 
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is  computed  for  each  element  in  the  line.  That  is,  the  first  2D  DFT  is  performed  on  all 
points  enclosed  in  squares,  the  second  is  computed  on  all  points  enclosed  in  triangles, 
and  the  third  2D  DFT  is  computed  on  all  points  enclosed  in  circles.  .After  this  stage  of 
2D  DFT  is  complete,  the  entire  computation  is  distributed  by  output  coset  throughout 
the  machine.  This  permutation  is  similar  to  the  permutation  that  results  from  a  typical 
application  of  the  vector-radix  algorithm. 

□ 

The  factorization  of  the  9  x  9  2D  DFT  given  in  the  example  above  contains  18  Fa  s 
2D  DFT.  If  each  of  these  were  evaluated  by  the  row-column  method  there  would  be 
108  FFTs  of  3— points  each.  Assume  the  reduction  operations  of  the  line  algorithm  are 
performed  concurrently  with  the  input  download  and  the  communication  time  is  not 
considered,  then  the  computational  speedup  of  the  hybrid  algorithm  of  the  example 
relative  to  a  single  processor  computing  the  3x3  factorization  is 


Speedup  = 


108FFT(3) 

27FFT(3) 


which  is  the  ideal  linear  speedup,  and  requires  no  interprocessor  communication. 

The  multiprocessor  computation  of  F„  was  described  in  previous  sections  of  this 
chapter.  The  procedure  that  follows  describes  the  multiprocessor  computation  of  F„ 
factored  into  stages  of  F,  and  Fr. 

1.  Associate  one  A;— dimensional  hyperplane  of  Z/r  with  each  PE.  The  required 
hyperplanes  are  given  by  theorems  2.2.1  and  2.3.1.  A  hyperplane  represents  a 
subset  of  the  output  of  a  single  F,. 

2.  The  input  is  partitioned  into  cosets  with  respect  to  the  subgroup  sZ/n.  These 
are  enumerated  by  u  €  Zjs.  Each  corresponds  to  ein  array  Z/r.  Assume  for 
simplicity  of  presentation  that  the  input  has  been  ordered  by  coset. 

3.  Broadcast  the  input  to  the  machine.  As  coset  u  is  input,  each  PE  computes  the 
reduction  operation  corresponding  to  the  hyperplane  assignment  of  step  (1). 
For  each  coset,  this  step  is  identic2d  to  the  multiprocessor  computation  of  a  single 
Fr.  When  this  step  is  complete,  s  independent  reduction  operations  have  been 
computed  by  each  PE. 

4.  Compute  a  number  s  of  A:— dimensional  DFT  at  each  PE.  Each  A:— dimensional 
DFT  is  computed  on  the  output  of  a  reduction  operation  from  step  (3).  Each 
produces  the  hyperplane  of  the  coset  associated  with  the  PE  in  step  (1). 


5.  Apply  the  twiddle  multiplications. 

6.  Compute  independent  DFTs  F,  at  each  PE.  There  are  s  A:— dimensional  hyper¬ 
planes  in  each  PE.  Each  hyperplane  heis  |,4|  points.  (>l|  DFTs  axe  performed.  For 
X  G  >4,  a  DFT  F,  is  performed  on  the  collection  of  all  points  with  index  x  from 
each  hyperplane  in  the  PE.  After  this  step  the  computation  is  complete. 
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7.  Data  upload.  The  output  is  distributed  among  the  PEs  according  to  the  coset 
decomposition  of  Zjn  with  respect  to  the  subgroup  rZfn.  This  is  a  normal 
vector-radix  [14]  permutation.  The  cosets  are  enumerated  hy  a  ^  Z lr_.  \i  Z It_ 
is  contained  in  the  A:— dimensional  hyperplane  associated  with  the  PE  in  step  (1), 
then  output  coset  a  is  also  contained  in  the  PE.  Cosets  of  output  are  redundant 
exactly  the  same  way  that  points  are  redundant  in  the  computation  of  a  single  F,. 
When  it  is  desirable,  the  redundancy  can  be  eliminated  in  an  analogous  manner. 

4.2  Conclusion 

This  chapter  presented  a  new  algorithm  for  the  multiprocessor  computation  of  the 
MD  DFT.  The  algorithm  is  based  on  the  paralellizacion  of  the  hyperplane  algorithm 
introduced  in  chapter  2.  It  is  intended  for  multiprocessors  connected  to  a  single  I/O 
channel.  The  target  processor  must  support  the  communication  functions  broadcast 
and  report. 

Implementations  of  the  algorithm  are  considered  for  two  basic  situations.  The 
first  is  when  the  degree  of  parallelism  of  the  target  processor  matches  one  of  Mk(n), 
k  =  1, . . . ,  d—  1.  Where  Mk(n)  is  the  number  of  A:— dimensionail  hyperplanes  required  to 
cover  Z/n.  For  these  cases  the  algorithm  is  shown  to  require  the  time  to  compute  one 
A:— dimensional  DFT  plus  the  time  to  input  and  output  the  data.  Computationally,  the 
speedup  of  the  algorithm  is  the  ratio  of  a  d— dimensional  DFT  to  a  single  Ar— dimensional 
DFT. 

The  second  case  of  <,he  algorithm  applies  when  the  size  of  the  machine  does  not 
match  Mfe(n).  That  is,  the  degree  of  parallelism  of  the  algorithm  is  not  compatible 
with  the  size  of  the  machine.  For  this  case  the  hyperpleine  algorithm  is  married  to  the 
multidimensional  Cooley-Tukey  algorithm  (MD  CT).  The  role  of  the  MD  CT  edgorithm 
in  the  hybrid  algorithm  is  to  factor  F^  into  stages  of  F,  and  Fr,  where  each  n,  is 
the  composite  n,  =  r,s,.  For  this  case  the  degree  of  parallelism  is  one  of  Mk{r),  for 
A:  =  l,...,d— 1.  In  this  manner  the  degree  of  parallelism  of  the  computation  can  be 
modified  to  match  the  size  of  the  target  processor.  Like  the  direct  method,  the  hybrid 
algorithm  requires  no  interprocessor  communication. 
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Chapter  5 

Program  Results 


This  research  program  developed  a  new  algorithm  for  the  multiprocessor  computation 
of  d— dimensional  discrete  Fourier  transform.  The  main  features  of  the  algorithm  are 
that  it  requires  no  interprocessor  communication  and  that  it  is  highly  scalable.  The 
motivation  for  this  research  program,  and  the  elements  that  were  required  to  realize  it 
are  described  below. 

This  research  program  has  been  driven  by  the  need  for  new  algorithmic  meth¬ 
ods  that  can  exploit  the  power  of  VLSI  based  multiprocessors.  These  machines  have 
large  computational  power  but  limited  communication  bandwidth.  They  strongly  fa¬ 
vor  algorithms  that  minimize  interprocessor  communication.  This  principle  of  locality 
dominates  algorithm  design  at  all  levels. 

Multidimensional  Cooley-Tukey  algorithms,  and  their  vjiriants,  require  interproces¬ 
sor  communication  because  they  partition  the  data  set  at  every  stage  of  the  compu¬ 
tation.  At  a  minimum  this  necessitates  an  interprocessor  communication  requirement 
where  every  processing  element  must  exchange  data  with  every  other  processing  ele¬ 
ment  to  complete  the  calculation. 

This  rese2irch  approaches  the  problem  of  parallel  MD  DFT  computation  from  a 
new  perspective.  It  applies  a  reduction  rather  them  a  partitioning  algorithm  to  MD 
DFT  computation.  The  result  is  that  no  interprocessor  communication  is  required 
from  input  to  output.  This  eliminates  the  need  for  an  interconnection  network  for 
parallel  MD  DFT  computation.  These  networks  are  the  slowest,  most  costly,  complex 
and  energy  inefficient  elements  of  the  multiprocessor  system.  The  method  developed 
in  this  program  is  nearly  linear  in  speedup,  it  is  highly  scalable,  and  it  requires  no 
interprocessor  communication. 

The  method  is  based  on  a  hyperplane  eilgorithm  for  MD  DFT  computation.  The 
hyperplane  algorithm  is  introduced  in  chapter  2.  It  is  derived  by  restricting  the  d— di¬ 
mensional  DFT  to  fc— dimensional  hyperplemes  of  the  output  array.  The  algorithm  is 
developed  in  two  parts.  The  first  part  is  the  specification  of  a  minimal  set  of  fc— dimen¬ 
sional  hyperplanes  that  cover  a  d— dimensional  array.  These  are  given  by  theorems 
2.2.1  and  2.3.1  in  sections  2.2  and  2.3  respectively.  The  second  part  of  the  derivation 
is  the  restriction  of  the  d— dimensional  DFT  to  the  hyperplanes  of  a  covering  set.  The 
formulation  of  the  resulting  algorithm  is  given  by  equations  (2.20)  and  (2.21)  or  (2.34) 


51 


and  (2.35).  An  appendix  to  chapter  2  is  provided  that  enumerates  3D  and  4D  cases  of 
the  algorithm. 

A  broadcast  mode  multiprocessor  algorithm  based  on  a  direct  mapping  of  the  hy¬ 
perplane  algorithm  is  given  in  chapter  3.  The  main  advantage  of  this  algorithm  is  that 
it  eliminates  the  need  for  interprocessor  commtmication  in  MD  DFT  computation.  The 
cost  of  eliminating  the  communication  requirement  typical  of  parallel  MD  DFT  algo¬ 
rithms  is  that  one  addition  must  be  performed  at  each  processor  for  every  input  loaded 
into  the  machine.  The  algorithm  requires  a  machine  whose  degree  of  parallelism  equals 
the  number  of  dimensional  hyperplanes  required  to  cover  a  d— dimensional  array. 
That  number  is  given  by  theorems  2.2.1  and  2.3.1  of  chapter  2.  This  imposes  limita¬ 
tions  on  the  degree  of  parallelism  of  the  target  machine  which  are  circumvented  by  the 
alternative  mapping  of  chapter  4.  The  structure  of  the  direct  mapping  algorithm  is 
given  by  the  procedure  of  figure  3.2. 

Chapter  4  presents  an  alternative  multiprocessor  mapping  of  the  hyperplane  algo¬ 
rithm  that  requires  no  interprocessor  communication  and  is  also  highly  scalable.  The 
algorithm  is  derived  by  applying  hyperplane  and  multidimensional  Cooley-Tukey  meth¬ 
ods  together.  The  resulting  algorithm  uses  Cooley-Tukey  methods  to  lower  the  order 
of  the  MD  DFT  stages  that  the  hyperplane  algorithm  is  applied  to.  The  degree  of  par¬ 
allelism  of  the  algorithm  is  equal  to  the  number  of  hyperplanes  required  to  cover  one  of 
the  lower  order  MD  DFT  of  the  Cooley-Tukey  factorization.  Like  the  direct  mapping 
algorithm  the  hybrid  algorithm  requires  no  interprocessor  communication.  The  cost  of 
eliminating  interprocessor  communication  by  this  method  is  that  one  addition  must  be 
performed  at  each  processing  element  for  each  word  loaded  into  the  machine. 
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